Your code currently has race status , so the result is incorrect. To illustrate why this is so, use a simple example:
You work on 2 threads, and the array int input[4] = {1, 2, 3, 4}; . You correctly initialize sum to 0 and are ready to start the loop. In the first iteration of your loop, stream 0 and stream 1 read sum from memory as 0 , and then add their corresponding element to sum and write back to memory. However, this means that stream 0 tries to write sum = 1 to memory (the first element is 1 and sum = 0 + 1 = 1 ), and stream 1 tries to write sum = 2 to the memory (second element is 2 , and sum = 0 + 2 = 2 ). The final result of this code depends on which of the latter ends last, and therefore writes the last to memory, which is a condition of the race. Not only this, but in this particular case, not one of the answers that the code could produce is correct! There are several ways around this; Below I will talk about three main ones:
#pragma omp critical
OpenMP has a so-called critical directive. This limits the code so that only one thread can do something at a time. For example, your for -loop can be written:
#pragma omp parallel for schedule(static) for(i = 0; i < snum; i++) { int *tmpsum = input + i;
This excludes the race condition, since only one stream is accessed and written to sum at a time. However, the critical directive is very bad for performance and is likely to kill most (if not all) of the gains you get from using OpenMP in the first place.
#pragma omp atomic
The atomic directive is very similar to the critical directive. The main difference is that although the critical directive applies to everything you would like to do one thread at a time, the atomic directive applies only to read / write operations in memory. Since all we do in this code example is read and write to summarize, this directive will work just fine:
#pragma omp parallel for schedule(static) for(i = 0; i < snum; i++) { int *tmpsum = input + i;
atomic performance is usually significantly better than critical performance. However, this is not the best option in your particular case.
reduction
The method that you should use, and the method that has already been suggested by others, is reduction . You can do this by changing for -loop to:
#pragma omp parallel for schedule(static) reduction(+:sum) for(i = 0; i < snum; i++) { int *tmpsum = input + i; sum += *tmpsum; }
The reduction command tells OpenMP that while the loop is working, you want each thread to track its own sum variable and add them all at the end of the loop. This is the most efficient method, since the whole cycle now runs in parallel, with the only overhead being at the end of the cycle when the sum values โโof each thread must be added up.