I'm trying to learn how to make GPUs with optimal OpenCL cores, I took an example of matrix multiplication using square tiles in local memory. However, at best I only got ~ 10x acceleration (~ 50 Gflops) compared to numpy.dot () (5 Gflops, it uses BLAS).
I found studies where they got acceleration> 200x (> 1000 Gflops) . ftp://ftp.u-aizu.ac.jp/u-aizu/doc/Tech-Report/2012/2012-002.pdf I donβt know what I'm doing wrong, or just because of my GPU (nvidia GTX 275). Or this is due to some of the costs of pyOpenCl. But I also estimated how long it takes to copy the result from the GPU to RAM, and this is only ~ 10% of the matrix multiplication time.
#define BLOCK_SIZE 22 __kernel void matrixMul( __global float* Cij, __global float* Aik, __global float* Bkj, __const int ni, __const int nj, __const int nk ){ // WARRNING : interchange of i and j dimension lower the performance >2x on my nV GT275 GPU int gj = get_global_id(0); int gi = get_global_id(1); int bj = get_group_id(0); int bi = get_group_id(1); // Block index int tj = get_local_id(0); int ti = get_local_id(1); // Thread index int oj = bi*BLOCK_SIZE; int oi = bj*BLOCK_SIZE; float Csub =0; __local float As [BLOCK_SIZE][BLOCK_SIZE]; __local float Bs [BLOCK_SIZE][BLOCK_SIZE]; for (int ok = 0; ok < nk; ok += BLOCK_SIZE ) { As[ti][tj] = Aik[ nk*(gi ) + tj + ok ]; // A[i][k] Bs[ti][tj] = Bkj[ nj*(ti+ok) + gj ]; // B[k][j] barrier(CLK_LOCAL_MEM_FENCE); for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ti][k] * Bs[k][tj]; barrier(CLK_LOCAL_MEM_FENCE); } Cij[ nj * ( gi ) + gj ] = Csub;
}
NOTE. weird BLOCK_SIZE = 22 is the maximum BLOCK_SIZE that corresponds to the maximum workgroup, which is 512 on my GPU. In this code, the condition BLOCK_SIZE ^ 2 <max work_group_size must be met. 22 = INT (SQRT (512)). I also tried BLOCK_SIZE = 16 or 8, but was slower than tan with 22.
I also tried a simple Mul matrix (without using local memory), but it was even 10 times slower than numpy.dot (). I copied the code here http://gpgpu-computing4.blogspot.cz/2009/10/matrix-multiplication-3-opencl.html they say that even a simple version (without local memory) should work 200 times faster than the processor ? I do not do this.
performance dependency in my case:
N = 220 numpy 3.680 [Gflops] GPU 16.428 [Gflops] speedUp 4.464 N = 330 numpy 4.752 [Gflops] GPU 29.487 [Gflops] speedUp 6.205 N = 440 numpy 4.914 [Gflops] GPU 37.096 [Gflops] speedUp 7.548 N = 550 numpy 3.849 [Gflops] GPU 47.019 [Gflops] speedUp 12.217 N = 660 numpy 5.251 [Gflops] GPU 49.999 [Gflops] speedUp 9.522 N = 770 numpy 4.565 [Gflops] GPU 48.567 [Gflops] speedUp 10.638 N = 880 numpy 5.452 [Gflops] GPU 44.444 [Gflops] speedUp 8.152 N = 990 numpy 4.976 [Gflops] GPU 42.187 [Gflops] speedUp 8.478 N = 1100 numpy 5.324 [Gflops] GPU 83.187 [Gflops] speedUp 15.625 N = 1210 numpy 5.401 [Gflops] GPU 57.147 [Gflops] speedUp 10.581 N = 1320 numpy 5.450 [Gflops] GPU 48.936 [Gflops] speedUp 8.979
NOTE. The number of "Gflops" is obtained as N ^ 3 / time, and it includes the time required to copy the results from the GPU to the main memory, but this time it is only a few percent of the total time, especially for N> 1000
Perhaps a more picturesque time in seconds:
N = 220 numpy 0.003 [s] GPU 0.001 [s] load 0.001 [s] speedUp 5.000 N = 330 numpy 0.008 [s] GPU 0.001 [s] load 0.001 [s] speedUp 7.683 N = 440 numpy 0.017 [s] GPU 0.002 [s] load 0.001 [s] speedUp 7.565 N = 550 numpy 0.043 [s] GPU 0.004 [s] load 0.001 [s] speedUp 11.957 N = 660 numpy 0.055 [s] GPU 0.006 [s] load 0.002 [s] speedUp 9.298 N = 770 numpy 0.100 [s] GPU 0.009 [s] load 0.003 [s] speedUp 10.638 N = 880 numpy 0.125 [s] GPU 0.010 [s] load 0.000 [s] speedUp 12.097 N = 990 numpy 0.195 [s] GPU 0.015 [s] load 0.000 [s] speedUp 12.581 N = 1100 numpy 0.250 [s] GPU 0.031 [s] load 0.000 [s] speedUp 8.065 N = 1210 numpy 0.328 [s] GPU 0.031 [s] load 0.000 [s] speedUp 10.581 N = 1320 numpy 0.422 [s] GPU 0.047 [s] load 0.000 [s] speedUp 8.979
I thought that perhaps some speed improvement could be obtained using async_work_group_copy or even read_imageui to copy blocks to local memory. But I donβt understand why I have such a big difference when I use basically the same code as people who say they have 200x acceleration ?????