Can race conditions reduce code performance?

I am running the following code for matrix multiplication, the performance of which I must measure:

for (int j = 0; j < COLUMNS; j++)
#pragma omp for schedule(dynamic, 10)
    for (int k = 0; k < COLUMNS; k++)
        for (int i = 0; i < ROWS; i++)
            matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];

Yes, I know this is very slow, but this is not the main thing - it is purely for measuring performance. I run 3 versions of the code depending on where I put the directive #pragma omp, and therefore, depending on where the parallelization takes place. Code runs in Microsoft Visual Studio 2012 in release mode and is profiled in CodeXL.

One thing that I noticed from the measurements is that the option in the code fragment (with parallelization to the k-loop) is the slowest, and then the version with the directive before the j-cycle, and then with it before i. The presented version is also the one that calculates the wrong result due to race conditions - the simultaneous use of multiple threads with the same element of the results matrix. I understand why the version of the i loop is the fastest - all specific threads process only part of the range of the i-variable, increasing the temporal locality. However, I do not understand what leads to the fact that the version of cycle k is the slowest - is this due to the fact that it produces the wrong result?

+4
2

, . ( ), , , . .

, , , . . , , .

+3

. , , .

  • - , . mat[x][y] mat[x][y+1] , mat[x][y] mat[x+1][y] dim(mat[x]) , x , y - , . __[i][j] += __[i][k] * __[k][j];, , i -> k -> j.
  • , , .

    for (int j = 0; j < COLUMNS; j++)
        for (int k = 0; k < COLUMNS; k++)
            for (int i = 0; i < ROWS; i++)
                matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
    

matrix_b[k][j] i .

    for (int j = 0; j < COLUMNS; j++)
        for (int k = 0; k < COLUMNS; k++)
            int temp = matrix_b[k][j];
            for (int i = 0; i < ROWS; i++)
                matrix_r[i][j] += matrix_a[i][k] * temp;

, , matrix_r[i][j], - matrix_r[i][j], , ,

for (int i = 0; i < ROWS; i++)
    matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];

matrix_r[i][j] ROWS . .

    for (int i = 0; i < ...; j++)
        for (int j = 0; j < ...; k++)
            int temp = 0;
            for (int k = 0; k < ...; i++)
                temp += matrix_a[i][k] * matrix_b[k][j];
            matrix_r[i][j] = temp;

n ^ 3 n ^ 2.

  • . , . - . , matrix_b ,

    matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j]; becomes 
    matrix_r[i][j] += matrix_a[i][k] * matrix_b_trans[j][k];
    

k , matrix_a matrix_b_trans

    for (int i = 0; i < ROWS; j++)
        for (int j = 0; j < COLS; k++)
            int temp = 0;
            for (int k = 0; k < SAMEDIM; i++)
                temp += matrix_a[i][k] * matrix_b_trans[j][k];
            matrix_r[i][j] = temp;
+1

Source: https://habr.com/ru/post/1625036/


All Articles