Unfortunately, modern multicore computer systems are not suitable for such a fine-grained internal parallelism cycle. This is not due to a problem with the creation / molding of the thread. As Itiaks noted, almost all OpenMP implementations use thread pools, i.e. They create multiple threads, and the threads are parked. Thus, there is actually no overhead for creating threads.
However, the problems of such parallel inner loops are as follows:
- Sending tasks / tasks to threads: even if we donβt need to physically create threads, at least we should assign tasks (= create logical tasks) to threads that basically require synchronization.
- Combining threads: after all threads in a command, these threads should be combined (unless the OpenMP directive is used). This is usually implemented as a barrier operation, which is also very intense synchronization.
Therefore, you should minimize the actual number of assignments / join threads. You can reduce this overhead by increasing the amount of internal loop work per call. This can be done with some code changes, such as a loop reversal.
source share