Multithreading is likely to improve performance if done correctly. As you stated about your problem, this is an ideal candidate for multithreading, since the calculations are independent, which minimizes the need for coordination between threads.
Some reasons why you may not get acceleration or may not get the full speed you expect may include:
1) The bottleneck may not be the built-in CPU execution resources (for example, ALU-related operations), but rather something shared, such as the memory or the overall bandwidth of the LLC.
For example, on some architectures, a single thread may saturate memory bandwidth, so adding more cores may not help. A more general case is that one core can saturate a fraction, 1 / N <1 of the main memory bandwidth, and this value is greater than 1 / C, where C is the kernel count. For example, on a quad core, one core can consume 50% of the bandwidth. Then, for a memory-related calculation, you will get a good scaling up to 2 cores (using 100% bandwidth), but almost nothing above that.
Other resources that are shared between the cores include: IO, GPU, SNOOP, etc. If you have a hyper-threaded platform, this list grows, including all cache levels and ALU resources for logical cores, the physical core.
2) The conflict "in practice" between operations that are "theoretically independent."
You note that your operations are independent. Usually this means that they are logically independent - they do not share any data (except, possibly, immutable data), and they can write to separate output areas. However, this does not preclude the possibility that any particular implementation actually has some hidden sharing.
One classic example is flag sharing β where independent variables fall into the same cache line, so logically independent writes to different variables from different threads blur the cache line between the cores.
Another common practice example is a conflict through a library - if your routines make heavy use of malloc, you may find that all threads spend most of their time locking inside the allocator, since malloc is a shared resource. This can be eliminated by reducing dependence on malloc (possibly through fewer large mallocs) or with good parallel malloc, such as hoard or tcmalloc.
3) Implementing the distribution and collection of computations by flows can exceed the benefits that you get from multiple threads. For example, if you create a new thread for each individual ray, the overhead of creating threads will dominate your runtime, and you are likely to see a negative advantage. Even if you use a constant-flow thread pool, selecting a βunit of workβ that is too fine-grained will impose a lot of coordination overhead, which may eliminate your benefits.
Similarly, if you need to copy input to and from workflows, you may not see the scaling that you expect. If possible, use pass-by-reference for read-only data.
4) You have no more than 1 core, or you have more than 1 core, but they are already busy starting other threads or processes. In these cases, the effort to coordinate multiple threads is pure overhead.