I see two possibilities. One of them is to divide the tasks K = N (N + 1) / 2 among M apriori threads. Below is just the pseudo code:
allocateTasks() { num = ceil(N*(N+1)/2/M); // number of tasks per thread i0 = 0, j0 = 0; n = 0; for (i = 0; i < N; ++i) { for (j = i; j < N; ++j) { // skip symmetries if (n == 0) {i0 = i; j0 = j;} ++n; if (n == num) { thread(i0, j0, num); n = 0; } } } if (n > 0) thread(i0, j0, n); } thread(i0, j0, num) { // perform num tasks starting at (i0, j0) i = i0; j = j0; for (n = 0; n < num; ++n) { // f[i, j] = ... ++j; if (j >= N) {++i; j = i;} } }
The problem is that when the calculation f (i, j) does not take the same time, then the load is not balanced between the threads.
So perhaps the second approach is better. Each thread takes the next available task. There is no global pre-distribution of tasks.
// init boolean array: done[0..N-1, 0..N-1] = false thread() { for (i = 0; i < N; ++i) { for (j = i; j < N; ++j) { // skip symmetries boolean mytask = false; lock(mutex); // synchronize access to global array done[][] if (!done[i, j]) { done[i, j] = true; mytask = true; } unlock(mutex); if (mytask) { f(i, j) = ... } } } }
In any case, access to f (i, j) will be through
g(i, j) = j >= i ? f(i, j) : f(j, i)
source share