OpenMP: Good Depth Search Strategies

I am writing a C ++ program that searches for brute force for closed jousting tours . The code is here .

I would like to parallelize this using OpenMP. My problem is to do this in such a way as to create a sufficient degree of parallelism. Currently, the relevant part of my code is as follows

#pragma omp parallel for reduction(+:count) if (depth==4) for (size_t i=0;i<g.neighbours[last].size();i++){ auto n = g.neighbours[last][i]; // See if n can be used to extend or complete the tour 

if (depth==4) is my attempt to make sure that not many parallel tasks are created, but on the other hand, it is enough that all busy processors are busy. Setting depth==2 does not change the execution time of the program.

This does not seem to work. For the 3x12 problem on my dual-core processor, the total processor time consumed by the OpenMP version is about 130 seconds, while the single-threaded version without OpenMP takes about 40 seconds of processor time.

I would appreciate suggestions on how best to use OpenMP or the reasons why it is not suitable for this problem.

UPDATE: Thanks to @Zulan, I have a newer version using OpenMP tasks with much faster sequential performance as well as good parallelization.

+5
source share
1 answer

First, let's see where your application spends its time using perf :

 perf record ./knights_tour_omp 3 12 perf report # Overhead Command Shared Object Symbol # ........ ............... ................... ................................................................................................. # 53.29% knights_tour_om libc-2.19.so [.] malloc 23.16% knights_tour_om libc-2.19.so [.] _int_free 10.31% knights_tour_om libc-2.19.so [.] _int_malloc 4.78% knights_tour_om knights_tour_omp [.] _Z13continue_tourIZ4mainEUlRKSt6vectorIiSaIiEEE_EmRK5GraphS4_RKS0_IcSaIcEEiiRT_.constprop.119 2.64% knights_tour_om libc-2.19.so [.] __memmove_ssse3_back 1.48% knights_tour_om libc-2.19.so [.] malloc_consolidate 

Your application spends all time allocating and freeing memory. Although there is a report that malloc not completely blocked , it does not seem to be parallelized either.

No further research is needed to find out that this is caused by copying the visited and path vectors for each iteration. Fortunately, DFS has a nice feature that you don't have to do this, you can reuse the state and restore it:

 visited[n] = 1; path.emplace_back(n); count += continue_tour(g, path, visited, depth+1,remain-1, cb); visited[n] = 0; path.pop_back(); 

However, for a parallel loop, you need to create working copies for each thread, so you need to make a separate path for this case with the original behavior.

This small change is already knocking down the production version from 22 s to 2.65 s. In addition, with two streams, it decreases to 2.0 s. There is acceleration, but it’s not so good - why? To illustrate this, I used 4 threads (in a fairly large system). Here is the performance of the application over time, recorded using score-p , shown in Vampir .

enter image description here

Red is the actual work, blue is the implicit barrier, that is, the flows are idling. It seems that there is always 3 or 1 active thread. The reason is simple: most of the positions on the board have 4 or 2 neighborhoods, but one of them has already been visited, so this iteration of the loop ends instantly. So the actual work is in 3 or 1 iteration of the loop. This is especially bad for two threads. With 3 threads, runtime is reduced to 1.7 s

Note: all the time from gcc 5.3 to E5-2690 @ 2.90 GHz without turbocharging. No care was taken to compensate for deviations during execution.

Given that there is very poor parallelism in this loop for this problem, it might be tempting to insert parallel loops. I urge you to try, but I do not think it will be good. I would think that in this context, tasks work well, because tasks can generate other tasks. That way, you can run each recursive step as a task and let the OMP runtime properly schedule it. But do not forget to limit it to a certain depth so that the tasks do not get too short.

I want to emphasize the importance of using tools. Trying to troubleshoot performance issues without performance analysis tools is how to troubleshoot errors without a debugger. perf readily available for Linux and its basic form is extremely easy to use. Nevertheless, it is very powerful (although some pitfalls can have extended use).

Additional note: with OpenMP it is often worth trying different compilers. For example, with a (fixed) task code, gcc does not scale more than 4 threads, while Intel's / omp compiler runtime provides acceleration even up to 24 threads.

+8
source

Source: https://habr.com/ru/post/1247768/


All Articles