Will multithreading improve performance significantly if I have a fixed number of calculations that are independent of each other?

I am programming a raycasting game engine. Each ray can be calculated without knowing anything about the other rays (I only calculate distances). Since there is no waiting time between calculations, I wonder if it is worth doing beam calculations multithreaded or not. Is it possible that there will be a performance improvement?

+6
source share
4 answers

Multithreading is likely to improve performance if done correctly. As you stated about your problem, this is an ideal candidate for multithreading, since the calculations are independent, which minimizes the need for coordination between threads.

Some reasons why you may not get acceleration or may not get the full speed you expect may include:

1) The bottleneck may not be the built-in CPU execution resources (for example, ALU-related operations), but rather something shared, such as the memory or the overall bandwidth of the LLC.

For example, on some architectures, a single thread may saturate memory bandwidth, so adding more cores may not help. A more general case is that one core can saturate a fraction, 1 / N <1 of the main memory bandwidth, and this value is greater than 1 / C, where C is the kernel count. For example, on a quad core, one core can consume 50% of the bandwidth. Then, for a memory-related calculation, you will get a good scaling up to 2 cores (using 100% bandwidth), but almost nothing above that.

Other resources that are shared between the cores include: IO, GPU, SNOOP, etc. If you have a hyper-threaded platform, this list grows, including all cache levels and ALU resources for logical cores, the physical core.

2) The conflict "in practice" between operations that are "theoretically independent."

You note that your operations are independent. Usually this means that they are logically independent - they do not share any data (except, possibly, immutable data), and they can write to separate output areas. However, this does not preclude the possibility that any particular implementation actually has some hidden sharing.

One classic example is flag sharing β€” where independent variables fall into the same cache line, so logically independent writes to different variables from different threads blur the cache line between the cores.

Another common practice example is a conflict through a library - if your routines make heavy use of malloc, you may find that all threads spend most of their time locking inside the allocator, since malloc is a shared resource. This can be eliminated by reducing dependence on malloc (possibly through fewer large mallocs) or with good parallel malloc, such as hoard or tcmalloc.

3) Implementing the distribution and collection of computations by flows can exceed the benefits that you get from multiple threads. For example, if you create a new thread for each individual ray, the overhead of creating threads will dominate your runtime, and you are likely to see a negative advantage. Even if you use a constant-flow thread pool, selecting a β€œunit of work” that is too fine-grained will impose a lot of coordination overhead, which may eliminate your benefits.

Similarly, if you need to copy input to and from workflows, you may not see the scaling that you expect. If possible, use pass-by-reference for read-only data.

4) You have no more than 1 core, or you have more than 1 core, but they are already busy starting other threads or processes. In these cases, the effort to coordinate multiple threads is pure overhead.

+10
source

Perhaps yes, multithreading (e.g. pthreads ) can improve performance; but you certainly want to compare (and you might be disappointed if your program is memory related, not CPU bound). And you can also consider OpenCL (to run some regular numerical computations on GPGPU) and OpenMP (to explicitly ask the compiler, through pragmas, to parallelize part of your code). Perhaps Open-MPI can be considered running on several transfer processes. And if you're brave (or crazy), you can mix several approaches.

In reality, it depends on the algorithm and the system (both hardware and operating systems), and you should compare (for example, some kind of micro-prototype related to your needs).

If a memory bottleneck (rather than a CPU) is a bottleneck in a particular system, multithreading or multiprocessing does not help much (and may possibly degrade performance). In addition, the cost of synchronization can vary greatly (for example, mutex locking can be very fast on some systems and 50% slower on others).

+2
source

In general, it depends. Given that the calculations are independent, it seems like a good candidate for better performance thanks to streaming usage. As a rule, calculations on the beam can be extracted from this.

However, there are many other factors, such as memory access requirements, as well as the underlying system on which it runs, which will have a huge impact on this. It often happens that multi-threaded versions work slower than single-threaded versions if they are not written correctly, so profiling is the only way to answer this question definitively.

+2
source

Most likely. Independent computing is an ideal candidate for parallelization. In the case of raycasting, there are so many of them that they will be well distributed through as many parallel streams as the equipment allows.

An unexpected bottleneck for calculations that would otherwise be perfect data independence could be writing to nearby locations simultaneously (false cache line sharing).

+1
source

Source: https://habr.com/ru/post/974000/


All Articles