Optimized opencl code with matrix multiplication

Below is the opencl kernel, which performs locked matrix multiplication for several independent matrices. selectMatrixA and selectMatrixB store multiple matrices (of the same size and square matrices) in row order.

// Matrix multiplication: C = A * B. #define BLOCK_SIZE 20 #define MATRIX_SIZE 100 * 100 #define BLOCK_DIMX 5 // Number of blocks in the x dimension __kernel void batchedMatrixMul(__global float *selectMatrixC, __global float *selectMatrixA, __global float *selectMatrixB, int wA, int wB) { // Block index int bx = get_group_id(0); int by = get_group_id(1); __global float *C = selectMatrixC + (bx/BLOCK_DIMX) * MATRIX_SIZE; __global float *A = selectMatrixA + (bx/BLOCK_DIMX) * MATRIX_SIZE; __global float *B = selectMatrixB + (bx/BLOCK_DIMX) * MATRIX_SIZE; int tx = get_local_id(0); int ty = get_local_id(1); float Csub = 0; // Identify the row and column of the C matrix to work on int Row = (by * BLOCK_SIZE) + ty; int Col = ((bx %(BLOCK_DIMX)) * BLOCK_SIZE) + tx; // Declaration of the local memory array As used to store the sub-matrix of A __local float As[BLOCK_SIZE][BLOCK_SIZE]; // Declaration of the local memory array Bs used to store the sub-matrix of B __local float Bs[BLOCK_SIZE][BLOCK_SIZE]; // Loop over all the sub-matrices of A and B required to compute the block sub-matrix for (int m = 0; m < wA / BLOCK_SIZE; ++m) { // Load the matrices from global memory to local memory. Each thread loads one //element of each matrix As[ty][tx] = A[Row * wA + m * BLOCK_SIZE + tx]; Bs[ty][tx] = B[(m * BLOCK_SIZE + ty)*wA + Col]; // Synchronize to make sure the matrices are loaded barrier(CLK_LOCAL_MEM_FENCE); // Multiply the two matrices together each thread computes one element of the block //sub-matrix for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ty][k] * Bs[k][tx]; // Synchronize to make sure that the preceding computation is done before loading //two new sub-matrices of A and B in the next iteration barrier(CLK_LOCAL_MEM_FENCE); } // Write the block sub-matrix to device memory each thread writes one element C[Row * wA + Col] = Csub; } 

This is how I run the kernel:

 localWorkSize[0] = BLOCK_SIZE; localWorkSize[1] = BLOCK_SIZE; // for a 100 X 100 matrix, MATRIX_DIMX = MATRIX_DIMY = 100 globalWorkSize[0] = MATRIX_DIMX * NUM_MATRICES; globalWorkSize[1] = MATRIX_DIMY ; cl_event event; errcode = clEnqueueNDRangeKernel(clCommandQueue, clKernel, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, &event); 

The following are some performance numbers when running on an NVIDIA Grid K520 grid:

 1. matrix size:100 X 100 . Number of matrices = 20000. Time taken for multiplication = 0.262 seconds. As shown in the code, the block size was set to 20. Block size of 10 was slower. This calculates to around 152 GFLOPS 2. matrix size: 10000 X 10000. Number of matrices = 1. Time taken for multiplication = 10.6 seconds. Here also the block size was 20. Using a block size of 50 is not possible due to the size of the local memory. 

Can someone please help me understand why the code is slow and why 2. much slower than 1. I am new to OpenCL and I want to learn how to optimize the code based on basic architectural details.

+5
source share
2 answers

The reason your first test is much faster is because there is a difference in the amount of work that each test does. In fact, the ratio is 50x.

Big-O for quadratic multiplication of the matrix O (n ^ 3). See: why is the time complexity of multiplying a square matrix defined as O (n ^ 3)? As a result, a 10k square square matrix actually takes a million times more work to multiply by one 100x100 multiplication. Your 20,000 100x100 multiplication operations do not compensate for the huge amount of work needed to multiply large matrices once.

Matrix multiplication is just a lot of point products. Your algorithm splits only point products into groups for ease of management and does not use any special tricks to reduce the numbers in my calculations below.

For a small matrix test:

 Total dot products: 10^4 MADs per dot product: 10^2 Total matrix-multiply operations: 20000 = 2*10^4 Total multiply-adds: 2* 10^(4+2+4) = 2*10^10 = 20,000,000,000 

20 billion.

Big matrix test:

 Total dot products: 10^8 MADs per dot product: 10^4 Total multiply operations: 1 (or 10^0) Grand total multiply-adds: 10 ^ (8 + 4 + 0) = 10^12 = 1,000,000,000,000 

1000 billion.

The 10000x10000 test was technically launched faster - crunching 50 times more operations, only 40 times longer execution time.

More on the "special tricks" here: http://en.wikipedia.org/wiki/Strassen_algorithm . Despite the fact that this algorithm is not considered practical for computing GPUs. Marine sophisticated algorithms also exist, but the brute force approach on graphics hardware seems to be used most often.

Why is your kernel slow at all? There are many different optimizations you can use to speed things up. Below are just a few of them that you can make Google and experiment with yourself. You will probably come across some that I have not mentioned here.

  • Optimize the size of workgroups and blocks. see opencl PREFERRED_WORK_GROUP_SIZE
  • Use float4 data type. opencl includes a dot product function that calculates a point product for floatn data types.
  • Move matrix B before starting the kernel. you can use another kernel for porting.
+3
source

In my opinion, the reason 2. is much slower is because the matrix multiplication access pattern is not so cache friendly. If you need to get the first value of the first line and the first value of the second line, they are stored in memory very far from each other. If the size of the matrix increases, they are saved even further from each other. This will result in a lot of cache misses.

I have no personal experience with matrix multiplication, but I just thought that you could store your data in a Z-order curve to get more cache. From Wikipedia links, this is similar to what was done by Valsalam and al 2002 .

Another quick solution, I would try, before using a lot of time for Z-ordering, is to use private variables and get rid of barriers. Even this requires more workloads from global memory, it is possible that the compiler can significantly improve the optimization of this code.

+1
source

Source: https://habr.com/ru/post/1202816/


All Articles