Parallel Sum for Vectors

Can someone please provide some suggestions on how I can reduce the following for the execution loop through multithreading? Suppose I also have two vectors called 'a' and 'b'.

for (int j = 0; j < 8000; j++){ // Perform an operation and store in the vector 'a' // Add 'a' to 'b' coefficient wise } 

This for loop runs many times in my program. The two operations in the for loop above are already optimized, but they only work on one core. However, I have 16 cores and would like to use them.

I tried modifying the loop as follows. Instead of having the vector "a", I have 16 vectors, and let the i-th be called a [i]. My for loop now looks like

 for (int j = 0; j < 500; j++){ for (int i = 0; i < 16; i++){ // Perform an operation and store in the vector 'a[i]' } for (int i = 0; i < 16; i++){ // Add 'a[i]' to 'b' coefficient wise } } 

I use OpenMp for each of the for loops inside, adding '#pragma omp parallel for' before each inner loop. All my processors are in use, but my runtime is significantly increased. Does anyone have any suggestions on how I can reduce the runtime of this loop? Thanks in advance.

+6
source share
3 answers

omp creates threads for your program if you insert a pragma tag, so you create threads for internal tags, but the problem is to create 16 threads, each of which performs 1 operation, and then all of them are destroyed using your method. creating and destroying threads takes a lot of time, so the method you use increases the time of your process, although it uses all 16 cores. you didn't need to create internal fors, just put the #pragma omp parallel for tag before your 8000 runs it before omp to split the values ​​between the protectors, so what you did to create the second loop is the omp job. in this case, omp creates threads only once, and then processes 500 numbers using this thread, and finishes them after that (using 499 less thread creation and destruction)

+5
source

Actually, I am going to put these comments in response.

Threading threads for trivial operations simply adds overhead.

First, make sure your compiler uses vector instructions to implement your loop. (If he doesn’t know how to do this, you may have to code the vector instructions yourself, try finding the “instrinsics” SSE. But for this kind of simple addition of vectors, automatic vectorization should be possible.)

Assuming your compiler is fairly modern GCC, call it with

 gcc -O3 -march=native ... 

Add -ftree-vectorizer-verbose=2 to find out if it automatically or not automatically starts your loop and why.

If you are already using vector instructions, you might be filling up your memory bandwidth. Modern processor cores are pretty fast ... If so, you need to restructure at a higher level to get more operations within each iteration of the loop, finding ways to perform a large number of operations on blocks that fit into the L1 cache.

+3
source

Does anyone have any suggestions on how I can decrease the runtime of this loop?

 for (int j = 0; j < 500; j++){ // outer loop for (int i = 0; i < 16; i++){ // inner loop 

Always try to make iterations of the outer loop smaller than the inner loop . This will allow you to repeatedly disconnect from the initialization of the inner loop . In the above code, the inner loop is i = 0; initialized 500 times. Now,

 for (int i = 0; j < 16; i++){ // outer loop for (int j = 0; j < 500; j++){ // inner loop 

Now the inner loop is j = 0; initialized only 16 times! Try changing your code accordingly if that affects.

0
source

Source: https://habr.com/ru/post/889808/


All Articles