Theoretically, find_end is parallelizable?

Question

Theoretically, find_end is parallelizable?

I am currently working on an open-std proposal to provide parallel functionality for a project that I am working on, but I launched into the road block with find_end .

Now find_end can be described as:

An algorithm that searches for the last subsequence of elements [s_first, s_last) in the range [first, last]. The first version uses the == operator to compare elements, the second version uses the given binary predicate p.

His requirements are laid out by cppreference . Now I had no problems with parallelizing find / findif / findifnot , etc. They could be easily divided into separate sections that were executed asynchronously, and I had no problems. The problem with find_end is that splitting the algorithm into pieces is not a solution, because if we say vector:

1 2 3 4 5 1 2 3 8

and we want to find 1 2 .

Well, first separate the vector into pieces asynchronously and just search the range in each fragment to the right? It seems to me that this is easy enough for me, however, what happens if for some reason there are only 3 kernels available, so the vector is divided into 3 parts:

1 2 3 | 4 5 1 | 2 3 8

Now I have a problem, the second range 1 2 divided into different sections. This will lead to a lot of invalid results, someone has x number of cores, which ultimately break the search results into y different sections. I thought I would do something like search chunks -> merge y chunks into y/2 chunks -> search -> in a search in a recursive style, but it seems so inefficient that the whole point of this algorithm is to improve efficiency. I too can overdo this test.

tl; dr , is there a way to parallelize find_end in a way that I don't think about?

+6

c ++ algorithm parallel-processing c ++ 11

Syntactic fruit Jul 24 '14 at 18:47

source share

2 answers

Since the find_end point is to find the last occurrence of the needle in the haystack, parallelization by splitting the haystack into adjacent segments often does not bring any benefit, because if the needle is actually in the last segment, the work performed by all processors other than that which is assigned to the last segment is wasted, and the time is exactly the same as with one processor. Theoretically, a parallel assessment allows you to limit the maximum search time, which is beneficial if (1) the processors do not compete for other tasks and (2) there are relatively few needle instances in the haystack.

In addition, you should be able to coordinate the completion of the process; each process can refuse a search when it finds a match or when its younger brother either found a match or refused a search. After process 0 has found a match or finished places to find them, the process with the lowest index with a match wins.

An alternative is to alternate the search. If you have processors k , then processor 0 is given sequences that end with last-0 , last-k , last-2k ..., processor 1 is given sequences that end with last-1 , last-k-1 , last-2k-1 ... and in general the processor i ( 0 ≤ i < k ) runs on last-i , last-ki , last-2k-i ...

The coordination of the process is slightly different from the first alternative. Again, every single process can stop as soon as it finds a match. In addition, any process may cease as soon as its current goal is less than the last match found by another process.

Although this should lead to a reasonable parallelization of the search, it is not clear to me that it will be better than a non-parallel linear time algorithm such as Knuth-Morris-Pratt or Boyer-Moore, any of which can be trivially modified to search from right to left. These algorithms will be especially useful in the unusual case when the needle is a compile-time constant that allows you to predict the necessary shift tables. Non-threaded parallelization may benefit from KMP or BM, with the same caution as above: it is likely that most of the involved process will be inappropriate.

+2

rici Jul 25 '14 at 0:14

source share

quantdev · Accepted Answer · 2014-07-24T18:57:41+0000

Yes, there is a way.

Let N be the size of the range you are looking for.

Once you have divided your vector into 3 pieces (3 separate workflows):

 1 2 3|4 5 1|2 3 8

You can allow each thread to go through its right adjacent fragment (if any) for N-1 elements (since only read operations are involved in the sequence, this is excellent and thread-safe).

In this case: (N = 2)

Core 1 runs on 1 2 3 4
Core 2 runs on 4 5 1 2
Core 3 runs on 2 3 8

Theoretically, find_end is parallelizable?

More articles: