Cost of OpenMPI in C ++

I have the following C ++ program that does not use communication, and the same identical work is done on all cores, I know that this does not use parallel processing at all:

unsigned n = 130000000; std::vector<double>vec1(n,1.0); std::vector<double>vec2(n,1.0); double precision :: t1,t2,dt; t1 = MPI_Wtime(); for (unsigned i = 0; i < n; i++) { // Do something so it not a trivial loop vec1[i] = vec2[i]+i; } t2 = MPI_Wtime(); dt = t2-t1; 

I run this program in one node with two Intel® Xeon® Processor E5-2690 v3, so I only have 24 cores. This is a dedicated node, no one else uses it. Since there is no connection, and each processor performs the same amount of (same) work, launching it on several processors should give the same time. However, I get the following time (average time over all cores):

1 core: 0.237

2 cores: 0.240

4 cores: 0.241

8 cores: 0.261

16 cores: 0.454

What can lead to an increase in time? Especially for 16 cores. I ran callgrind and I get about the same amount of data / command misses on all cores (the percentage of skips is the same).

I repeated the same test on a node with two Intel® Xeon® E5-2628L v2 processors (16 cores together), I observe the same increase in runtime. Does this have anything to do with MPI implementation?

+5
source share
3 answers

I suspect that there are shared resources that should be used by your program, so when their number increases, there are delays, so the resource is free so that it can be used by another process.

You see, you can have 24 cores, but this does not mean that your whole system allows each core to do everything at the same time. As mentioned in the comments, memory access is one thing that can cause delays (due to traffic), the same goes for the disk.

Also consider a network connection, which may also suffer from many accesses. In conclusion, note that these hardware delays are enough to suppress processing time.


General Note: Remember how program performance is determined:

E = S / p, where S is the acceleration, and p is the number of nodes / processes / threads

Now consider scalability. Usually programs are weakly scalable, i.e. You need to increase the problem size and p at the same speed. Increasing only the number p, while keeping the size of your problem ( n in your case) constant, keeping Efficiency constant, gives a highly scalable program.

+2
source

Given that you use ~ 2 gigabytes of memory per rank, your code is memory tied. With the exception of prefetchers, you are not working in the cache, but in the main memory. You simply click on the memory bandwidth on a certain number of active cores.

Another aspect may be turbo mode, if enabled. Turbo mode can increase the core frequency to higher levels if fewer cores are used. While the memory bandwidth is not saturated, a higher frequency from the turbo core will increase the bandwidth received by each core. This document discusses the available aggregate memory bandwidth on Haswell processors depending on the number of active cores and frequency (Fig. 7./8.)

Note that this has nothing to do with MPI / OpenMPI. You can also run the same program X times using any other means.

+4
source

Your program does not use parallel processing at all. Just because you compiled it using OpenMP, it does not make it parallel.

To parallelize a for loop, for example, you need to use another #pragma OpenMP clause.

 unsigned n = 130000000; std::vector<double>vec1(n,1.0); std::vector<double>vec2(n,1.0); double precision :: t1,t2,dt; t1 = MPI_Wtime(); #pragma omp parallel for for (unsigned i = 0; i < n; i++) { // Do something so it not a trivial loop vec1[i] = vec2[i]+i; } t2 = MPI_Wtime(); dt = t2-t1; 

However, keep in mind that for large values ​​of n, the effect of misses in the cache may obscure the performance obtained with the help of several cores.

+1
source

Source: https://habr.com/ru/post/1242959/


All Articles