Not sure how to explain some performance results of my matrix multiplication code

Question

Not sure how to explain some performance results of my matrix multiplication code

I run this code in OpenMP to multiply the matrix, and I measured its results:

#pragma omp for schedule(static)
for (int j = 0; j < COLUMNS; j++)
    for (int k = 0; k < COLUMNS; k++)
        for (int i = 0; i < ROWS; i++)
            matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];

There are different versions of the code based on where I put the directive #pragma omp- before loop j, loop k, or loop i. In addition, for each of these options, I ran different versions for default static planning, static planning with pieces 1 and 10, and dynamic planning with the same pieces. I also measured the number of DC hits, DC passes, processor hours, retired instructions, and other performance metrics in CodeXL. Here are the results for a 1000x1000 matrix on AMD Phenom i X4 945:

Performance Results

Where multiply_matrices_1_dynamic_1is a function with #pragma ompbefore the first cycle and a dynamic schedule with piece 1, etc. Here are some things that I don’t quite understand about the results and would like to help:

The default static version with parallelization before the inner loop works at 2.512, and the serial version runs at 31,683s - which, despite the fact that it works on a 4-core computer, I assumed that the largest possible acceleration is 4x. Is this logical for this, or is it some kind of mistake? How can I explain this?
- CodeXL says that the 3rd version with static scheduling has a much smaller number of DC accesses (and omissions) than other versions. Why is this? Is this not so much because all parallel threads work in the same cell of matrix b. This is true?
- , - , R, . ? ?
- , , . , 3- ( i) ( 1000x1000), . 2- ( 2- ). ?

, TLB . DTLB? , DC - DTLB, , - TLB , DC. TLB? , TBL / DC. , TLB. ?

+1

c++ performance parallel-processing matrix-multiplication openmp

M. Zdunek 19 . '16 6:13

2

Z boson · Answer 1 · 2016-01-19T10:11:28+0000

Gilles , , - , k matrix_b[k][j].

matrix_b, matrix_bT[j][k] k, . O(n^2)) O(n^3), 1/n. .. n .

, . j :

#pragma omp for schedule(static)
for (int i = 0; i < ROWS; i++ ) {
    for (int k = 0; k < COLUMNS; k++ ) {
        for ( int j = 0; j < COLUMNS; j++ ) {
           matrix_r[i][j] += matrix_a[i][k]*matrix_b[k][j];
        }
    }
}

Gilles , , , .

Gilles · Answer 2 · 2016-01-19T07:09:04+0000

, , , , , , . , .

, , , , , , , : , . , , , .

, , ( ):

#pragma omp for schedule( static )
for ( int i = 0; i < ROWS; i++ ) {
    for ( int j = 0; j < COLUMNS; j++ ) {
        auto res = matrix_r[i][j]; // IDK the type here
        #pragma omp simd reduction( + : res )
        for ( int k = 0; k < COLUMNS; k++ ) {
           res += matrix_a[i][k] * matrix_b[k][j];
        }
        matrix_r[i][j] = res;
    }
}

(NB: simd , , )

, , / .

Not sure how to explain some performance results of my matrix multiplication code

More articles: