Should OpenCL matrix multiplication be faster?

I'm trying to learn how to make GPUs with optimal OpenCL cores, I took an example of matrix multiplication using square tiles in local memory. However, at best I only got ~ 10x acceleration (~ 50 Gflops) compared to numpy.dot () (5 Gflops, it uses BLAS).

I found studies where they got acceleration> 200x (> 1000 Gflops) . ftp://ftp.u-aizu.ac.jp/u-aizu/doc/Tech-Report/2012/2012-002.pdf I don’t know what I'm doing wrong, or just because of my GPU (nvidia GTX 275). Or this is due to some of the costs of pyOpenCl. But I also estimated how long it takes to copy the result from the GPU to RAM, and this is only ~ 10% of the matrix multiplication time.

#define BLOCK_SIZE 22 __kernel void matrixMul( __global float* Cij, __global float* Aik, __global float* Bkj, __const int ni, __const int nj, __const int nk ){ // WARRNING : interchange of i and j dimension lower the performance >2x on my nV GT275 GPU int gj = get_global_id(0); int gi = get_global_id(1); int bj = get_group_id(0); int bi = get_group_id(1); // Block index int tj = get_local_id(0); int ti = get_local_id(1); // Thread index int oj = bi*BLOCK_SIZE; int oi = bj*BLOCK_SIZE; float Csub =0; __local float As [BLOCK_SIZE][BLOCK_SIZE]; __local float Bs [BLOCK_SIZE][BLOCK_SIZE]; for (int ok = 0; ok < nk; ok += BLOCK_SIZE ) { As[ti][tj] = Aik[ nk*(gi ) + tj + ok ]; // A[i][k] Bs[ti][tj] = Bkj[ nj*(ti+ok) + gj ]; // B[k][j] barrier(CLK_LOCAL_MEM_FENCE); for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ti][k] * Bs[k][tj]; barrier(CLK_LOCAL_MEM_FENCE); } Cij[ nj * ( gi ) + gj ] = Csub; 

}

NOTE. weird BLOCK_SIZE = 22 is the maximum BLOCK_SIZE that corresponds to the maximum workgroup, which is 512 on my GPU. In this code, the condition BLOCK_SIZE ^ 2 <max work_group_size must be met. 22 = INT (SQRT (512)). I also tried BLOCK_SIZE = 16 or 8, but was slower than tan with 22.

I also tried a simple Mul matrix (without using local memory), but it was even 10 times slower than numpy.dot (). I copied the code here http://gpgpu-computing4.blogspot.cz/2009/10/matrix-multiplication-3-opencl.html they say that even a simple version (without local memory) should work 200 times faster than the processor ? I do not do this.

performance dependency in my case:

 N = 220 numpy 3.680 [Gflops] GPU 16.428 [Gflops] speedUp 4.464 N = 330 numpy 4.752 [Gflops] GPU 29.487 [Gflops] speedUp 6.205 N = 440 numpy 4.914 [Gflops] GPU 37.096 [Gflops] speedUp 7.548 N = 550 numpy 3.849 [Gflops] GPU 47.019 [Gflops] speedUp 12.217 N = 660 numpy 5.251 [Gflops] GPU 49.999 [Gflops] speedUp 9.522 N = 770 numpy 4.565 [Gflops] GPU 48.567 [Gflops] speedUp 10.638 N = 880 numpy 5.452 [Gflops] GPU 44.444 [Gflops] speedUp 8.152 N = 990 numpy 4.976 [Gflops] GPU 42.187 [Gflops] speedUp 8.478 N = 1100 numpy 5.324 [Gflops] GPU 83.187 [Gflops] speedUp 15.625 N = 1210 numpy 5.401 [Gflops] GPU 57.147 [Gflops] speedUp 10.581 N = 1320 numpy 5.450 [Gflops] GPU 48.936 [Gflops] speedUp 8.979 

NOTE. The number of "Gflops" is obtained as N ^ 3 / time, and it includes the time required to copy the results from the GPU to the main memory, but this time it is only a few percent of the total time, especially for N> 1000

Perhaps a more picturesque time in seconds:

 N = 220 numpy 0.003 [s] GPU 0.001 [s] load 0.001 [s] speedUp 5.000 N = 330 numpy 0.008 [s] GPU 0.001 [s] load 0.001 [s] speedUp 7.683 N = 440 numpy 0.017 [s] GPU 0.002 [s] load 0.001 [s] speedUp 7.565 N = 550 numpy 0.043 [s] GPU 0.004 [s] load 0.001 [s] speedUp 11.957 N = 660 numpy 0.055 [s] GPU 0.006 [s] load 0.002 [s] speedUp 9.298 N = 770 numpy 0.100 [s] GPU 0.009 [s] load 0.003 [s] speedUp 10.638 N = 880 numpy 0.125 [s] GPU 0.010 [s] load 0.000 [s] speedUp 12.097 N = 990 numpy 0.195 [s] GPU 0.015 [s] load 0.000 [s] speedUp 12.581 N = 1100 numpy 0.250 [s] GPU 0.031 [s] load 0.000 [s] speedUp 8.065 N = 1210 numpy 0.328 [s] GPU 0.031 [s] load 0.000 [s] speedUp 10.581 N = 1320 numpy 0.422 [s] GPU 0.047 [s] load 0.000 [s] speedUp 8.979 

I thought that perhaps some speed improvement could be obtained using async_work_group_copy or even read_imageui to copy blocks to local memory. But I don’t understand why I have such a big difference when I use basically the same code as people who say they have 200x acceleration ?????

+4
source share
2 answers

Without even looking at your code, let me make some comments about your tests. Let numpy be ignored and compare the maximum SP FLOPs / s and DP FLOPs / s of the Intel processor with the Nvidia and AMD processors.

A Intel 2600K with a frequency of 4 GHz can perform 4 GHz * (8 AVX) * (2 ILP) * (4 cores) = 256 SP GFLOPs / s. For DP, this is half: 128 DP GFLOPs / s. Haswell, which will be released in a few weeks, will double them both. Intel MKL library gains over 80% efficiency in GEMM. My own GEMM code gets 70% on my i7-2700, so the 5 GFlops / s that you quote with numpy are tiny and dishonest for comparison.

I do not know what the GTX 275 is capable of, but I would have guessed that it is much more than 50 GFLOP / s.

In this article, you are comparing an AMD 7970. They get 848 (90% efficiency) DP GFlops / s and 2646 (70% efficiency) SP GFlops / s. What is closer to 10x is not 200x processor performance!

Edit: Wrong calculations FLOPs should be 2.0 * n ^ 3. This is still approaching, but it is asymptotically true. Let me explain.

Consider a three-dimensional point product. This is x1 * x2 + y1 * y2 + z1 * z2. These are 3 multiplications and two additions. Thus, an N-dimensional point product represents n multiplications and (n-1) additions. The matrix product is equivalent to nxn-point products, i.e. N * n * n multiplications and n * n * (n-1) additions. This is approximately 2.0 * n ^ 3 FLOPS. Therefore, you must double all of your Gflops / s numbers.

Edit: You might want to consider kernel time. It has been a while since I used OpenCL, but using C ++ bindings, I did something like this

 queue = cl::CommandQueue(context, devices[device], CL_QUEUE_PROFILING_ENABLE|CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err); //other code...run kernel time_end = clevent.getProfilingInfo<CL_PROFILING_COMMAND_END>(); time_start = clevent.getProfilingInfo<CL_PROFILING_COMMAND_START>(); 
+5
source

A good combination of GPU matrices not only uses local memory, it stores blocks A, B and / or C in registers (which leads to higher use of registers and lower load, but in the end it is much faster). This is due to the fact that GPUs have more registers than local memory (128-256 KB versus 48 KB for NVIDIA), and the registers offer the same throughput as ALU.

+1
source

Source: https://habr.com/ru/post/1482705/


All Articles