First, let's see where your application spends its time using perf :
perf record ./knights_tour_omp 3 12 perf report
Your application spends all time allocating and freeing memory. Although there is a report that malloc not completely blocked , it does not seem to be parallelized either.
No further research is needed to find out that this is caused by copying the visited and path vectors for each iteration. Fortunately, DFS has a nice feature that you don't have to do this, you can reuse the state and restore it:
visited[n] = 1; path.emplace_back(n); count += continue_tour(g, path, visited, depth+1,remain-1, cb); visited[n] = 0; path.pop_back();
However, for a parallel loop, you need to create working copies for each thread, so you need to make a separate path for this case with the original behavior.
This small change is already knocking down the production version from 22 s to 2.65 s. In addition, with two streams, it decreases to 2.0 s. There is acceleration, but itβs not so good - why? To illustrate this, I used 4 threads (in a fairly large system). Here is the performance of the application over time, recorded using score-p , shown in Vampir .

Red is the actual work, blue is the implicit barrier, that is, the flows are idling. It seems that there is always 3 or 1 active thread. The reason is simple: most of the positions on the board have 4 or 2 neighborhoods, but one of them has already been visited, so this iteration of the loop ends instantly. So the actual work is in 3 or 1 iteration of the loop. This is especially bad for two threads. With 3 threads, runtime is reduced to 1.7 s
Note: all the time from gcc 5.3 to E5-2690 @ 2.90 GHz without turbocharging. No care was taken to compensate for deviations during execution.
Given that there is very poor parallelism in this loop for this problem, it might be tempting to insert parallel loops. I urge you to try, but I do not think it will be good. I would think that in this context, tasks work well, because tasks can generate other tasks. That way, you can run each recursive step as a task and let the OMP runtime properly schedule it. But do not forget to limit it to a certain depth so that the tasks do not get too short.
I want to emphasize the importance of using tools. Trying to troubleshoot performance issues without performance analysis tools is how to troubleshoot errors without a debugger. perf readily available for Linux and its basic form is extremely easy to use. Nevertheless, it is very powerful (although some pitfalls can have extended use).
Additional note: with OpenMP it is often worth trying different compilers. For example, with a (fixed) task code, gcc does not scale more than 4 threads, while Intel's / omp compiler runtime provides acceleration even up to 24 threads.