One way to use multithreading for Q2 is to do this in two passes (each of which uses T threads inside, where T can be freely selected):
Calculate c[i] = a[i] + b[i] + c[i-1] for all cells in the normal multithreaded mode, that is, after dividing the input into the ranges [0, r1), [r1, r2) ,. .. [rk, n) and applying one stream for each range. Yes, this will not be true for all ranges except the first, and step 2 will correct it.
Compute corrections again in a multi-threaded way. To do this, we look at the right value of each range, i.e. corr1:=c[r1-1] , corr2:=corr1+c[r2-1] , corr3:=corr2+c[r3-1] etc., which gives us a correction value for each stream, and then calculates again using multithreading with the same ranges as before, c[i] += corrk where corrk is the correction value for the k-th stream. (For a zero thread, we can use corr0:=0 , so the thread should not do anything.)
This improves the theoretical runtime by a factor of T, where T is the number of threads (which can be freely selected), so this is the optimal solution for multithreading.
To illustrate how this works, here is an example where we assume arrays of length n==30 . We also assume that we use 3 threads: one for calculating the range c[0..9] , one for c[10..19] and one for c[20..29] .
It is clear that the goal in the cell c[i] for any 0<i<n obtained
c[i] == a[0]+...+a[i]+b[0]+...+b[i]
(that is, the sum of all a[0..i] and all b[0..i] ) after the completion of the algorithm. Let's see how the algorithm works there, for example cell i==23 . This cell is processed by the third thread, that is, the thread responsible for the range c[20..29] .
Step 1: Stream Sets
c[20] = a[20]+b[20] c[21] = c[20]+a[21]+b[21] == a[20]+a[21]+b[20]+b[21] c[22] = c[21]+a[22]+b[22] == a[20]+a[21]+a[22]+b[20]+b[21]+b[22] c[23] = c[22]+a[23]+b[23] == a[20]+a[21]+a[22]+a[23]+b[20]+b[21]+b[22]+b[23] ...
So, after step 1 is completed, we have some of a[20..23] and b[20..23] in cell c[23] . What is missing is the sum of a[0..19] and b[0..19] .
Similarly, the first and second threads set the values ββin c[0..9] and c[10..19] so that
c[0] = a[0]+b[0] c[1] = c[0]+a[1]+b[1] == a[0]+a[1]+b[0]+b[1] ... c[9] = a[0]+...+a[9]+b[0]+...+b[9]
and
c[10] = a[10]+b[10] ... c[19] = a[10]+...+a[19]+b[10]+...+b[19]
Step 2: The correction value for the third thread corr2 is the sum of corr1 and the rightmost value calculated by the second thread, while corr1 is the rightmost value calculated by the first thread, Therefore
corr2 == c[9]+c[19] == (a[0]+...+a[9]+b[0]+...+b[9]) + (a[10]+...+a[19]+b[10]+...+b[19])
And this is indeed the sum absent in the value of c[23] after step 1. In step 2, we add this value to all elements of c[20..29] , therefore, after completing step 2, c[23] is correct (and all other cells )
The reason this works is that calculating cell values ββis an operation from left to right, strictly incremental, and the order of operations for calculating a single cell does not matter (since + associative and commutative). Therefore, the final result (the "rightmost value") of any given stream after step 1 can be used to correct the results of the streams responsible for the ranges to the right in step 2.