Parallelization of the quadratic algorithm

Let's say I have a vector (array, list, any ...) V of N elements, let V 0 - V (N-1) . For each element V i, the function f (V i , V j ) must be calculated for each index j (including the case i = J). The function is symmetric, so when f (V i , V j ) is computed, there is no need to reconfigure f (V j , B <sub> Isub>). Then we have N (N + 1) / 2 complete function estimates, making this algorithm O (N 2 ). Assume that the time taken to compute f is relatively long but consistent.

Now I want to parallelize the execution of the algorithm. I need to define a schedule for workflows (some M numbers) so that two threads do not use the same piece of memory (i.e. the same item) at the same time. For example, f (V 1 , V 2 ) can be estimated parallel to f (V 3 , V 4 ), but not parallel to f (V 2 , V 3 ). The workflow is divided into steps such that for each step, each workflow performs one estimate of f. Then the threads are synchronized, after which they move on to the next step, etc.

Question: how to determine (preferably optimally) the schedule for each thread in the form of a series of pairs of indices (i, j) to solve the whole problem (i.e. each pair of indices visited exactly once, given the symmetry)? Of course, a direct answer would be nice, I would also appreciate a pointer to an algorithm, or even to relevant websites / literature.
+6
source share
3 answers

This reminds me of the general problem of planning sporting events: in a league with N teams, organize N-1 gamedays so that each team has one game per day and plays each other team once.

Playing chess, there is a rather illustrative solution to this problem. Arrange all the boards side by side on a long table. One player always stays in one chair. Other players spin around the table in the same direction, skipping this player.

+5
source

Let's look at a direct implementation:

  for (i = 0; i <N; ++ i)
 {
     for (j = 1; j <i; ++ J)
     {
         compute i, j pair and assign to i, j, and j, i result
     }
 }

I am a C ++ programmer, so I can think of parallelism parallelism, say, using OpenMP:

  #pragma OMP parallel for
 for (i = 0; i <N; ++ i)
 {
     ....
 }

If you do not know OpenMP, it simply divides the cycle into n cycles according to the number of processors and runs them in parallel. I do not think that in this case the result will be good, because each cycle I + 1 is shorter than the cycle. Let this algorithm write in such a way that all loops have the same length. The total number of cycles will be N / 2. The first loop processes lines 0 and N-1. The second cycle processes lines 1 and N-2, etc. Such a cycle can be successfully parallelized, without conflicts. Simplified code without details about even / odd N, etc .:

  #pragma OMP parallel for
 for (i = 0; i <N / 2; ++ i)
 {
     Handle i value
     ...

     Handle N - 1 - i value
     ...
 }

This is just a general idea, there are wrong details that you can fix.

0
source

I see two possibilities. One of them is to divide the tasks K = N (N + 1) / 2 among M apriori threads. Below is just the pseudo code:

allocateTasks() { num = ceil(N*(N+1)/2/M); // number of tasks per thread i0 = 0, j0 = 0; n = 0; for (i = 0; i < N; ++i) { for (j = i; j < N; ++j) { // skip symmetries if (n == 0) {i0 = i; j0 = j;} ++n; if (n == num) { thread(i0, j0, num); n = 0; } } } if (n > 0) thread(i0, j0, n); } thread(i0, j0, num) { // perform num tasks starting at (i0, j0) i = i0; j = j0; for (n = 0; n < num; ++n) { // f[i, j] = ... ++j; if (j >= N) {++i; j = i;} } } 

The problem is that when the calculation f (i, j) does not take the same time, then the load is not balanced between the threads.

So perhaps the second approach is better. Each thread takes the next available task. There is no global pre-distribution of tasks.

 // init boolean array: done[0..N-1, 0..N-1] = false thread() { for (i = 0; i < N; ++i) { for (j = i; j < N; ++j) { // skip symmetries boolean mytask = false; lock(mutex); // synchronize access to global array done[][] if (!done[i, j]) { done[i, j] = true; mytask = true; } unlock(mutex); if (mytask) { f(i, j) = ... } } } } 

In any case, access to f (i, j) will be through

 g(i, j) = j >= i ? f(i, j) : f(j, i) 
0
source

Source: https://habr.com/ru/post/898269/


All Articles