I implemented the Barnes-Hutt gravity algorithm in C as follows:
- Build a tree of clustered stars.
- For each star, cross a tree and apply gravitational forces from each applicable node.
- Update speed and position of stars.
Stage 2 is the most expensive stage, and therefore it is implemented in parallel, dividing many stars. For instance. with 1000 stars and two threads, I have one thread processing the first 500 stars, and the second thread processes the second 500.
In practice, this works: it speeds up the calculation by about 30% with two threads on a dual-core machine compared to the version without streaming. In addition, it gives the same numerical results as the original version without streaming.
My concern is that two threads are simultaneously accessing the same resource (namely, the tree). I did not add synchronization with workflows, so probably at some point they will try to read from one place. Although access to the tree is strictly for reading, I am not 100% sure of security. It worked when I tested it, but I know that this is not a guarantee of correctness!
Questions
- Do I need to make a private copy of the tree for each thread?
- Even if it's safe, are there any performance issues to access the same memory from multiple threads?
Update Test Results for the curious:
Machine: Intel Atom N270 @ 1.60 GHz processor, 800 MHz processor, 512 KB cache
Threads real user sys 0 69.056 67.324 1.720 1 76.821 66.268 5.296 2 50.272 63.608 10.585 3 55.510 55.907 13.169 4 49.789 43.291 29.838 5 54.245 41.423 31.094
0 means no threads; 1 and above means that many workflows and the main thread are waiting for them. I would not expect significant improvements for anything other than two threads, since it is completely connected with the processor and how many cores there are. Interestingly, the odd number of threads is slightly worse than the even number.
Looking at sys , it is obvious that there are costs for creating threads. It currently creates threads for each frame (so N * 1000 creates threads). It was easy to program (for my 15 minutes on the train this morning). I need to think a bit about how to reuse threads ...
Update # 2 . I used a thread pool synchronized with two barriers. This does not have a noticeable performance advantage over recreating threads in each frame.