Cuda - OpenCL processor 4 times faster than OpenCL or CUDA GPU version

The wave simulator with which I worked with C # + Cudafy (C # โ†’ CUDA or OpenCL translator) works fine, except that the OpenCL processor version (Intel driver, 15 "MacBook Pro Retina i7 2.7 GHz, GeForce 650M ( Kepler, 384 cores)) is about four times faster than the GPU version.

(This happens whether I use the CL server or the CUDA GPU. The versions of the OpenCL GPU and CUDA work almost the same way.)

To clarify, for an example of a problem:

  • OpenCL CPU 1200 Hz
  • OpenCL GPU 320 Hz
  • CUDA GPU - ~ 330 Hz

I find it difficult to explain why the processor version will be faster than the GPU. In this case, the kernel code that executes (in the case of CL) on the CPU and GPU is identical. During initialization, I choose a processor or GPU device, but apart from that, everything is identical.

Edit

Here's the C # code starting one of the cores. (The rest are very similar.)

public override void UpdateEz(Source source, float Time, float ca, float cb) { var blockSize = new dim3(1); var gridSize = new dim3(_gpuEz.Field.GetLength(0),_gpuEz.Field.GetLength(1)); Gpu.Launch(gridSize, blockSize) .CudaUpdateEz( Time , ca , cb , source.Position.X , source.Position.Y , source.Value , _gpuHx.Field , _gpuHy.Field , _gpuEz.Field ); } 

And so, the corresponding CUDA kernel function generated by Cudafy:

 extern "C" __global__ void CudaUpdateEz(float time, float ca, float cb, int sourceX, int sourceY, float sourceValue, float* hx, int hxLen0, int hxLen1, float* hy, int hyLen0, int hyLen1, float* ez, int ezLen0, int ezLen1) { int x = blockIdx.x; int y = blockIdx.y; if (x > 0 && x < ezLen0 - 1 && y > 0 && y < ezLen1 - 1) { ez[(x) * ezLen1 + ( y)] = ca * ez[(x) * ezLen1 + ( y)] + cb * (hy[(x) * hyLen1 + ( y)] - hy[(x - 1) * hyLen1 + ( y)]) - cb * (hx[(x) * hxLen1 + ( y)] - hx[(x) * hxLen1 + ( y - 1)]); } if (x == sourceX && y == sourceY) { ez[(x) * ezLen1 + ( y)] += sourceValue; } } 

Just for completeness, here is C #, which is used to generate CUDA:

  [Cudafy] public static void CudaUpdateEz( GThread thread , float time , float ca , float cb , int sourceX , int sourceY , float sourceValue , float[,] hx , float[,] hy , float[,] ez ) { var i = thread.blockIdx.x; var j = thread.blockIdx.y; if (i > 0 && i < ez.GetLength(0) - 1 && j > 0 && j < ez.GetLength(1) - 1) ez[i, j] = ca * ez[i, j] + cb * (hy[i, j] - hy[i - 1, j]) - cb * (hx[i, j] - hx[i, j - 1]) ; if (i == sourceX && j == sourceY) ez[i, j] += sourceValue; } 

Obviously, if in this kernel is bad, but even the resulting pipeline should not cause such an extreme performance delta.

The only other thing that jumps at me is that I use a chrome grid / block distribution scheme - i.e. the grid represents the size of the updated array, and each block is a single thread. I am sure that this has some impact on performance, but I donโ€™t see it being 1/4 of the speed of the CL code running on the CPU. ARGH!

+6
source share
1 answer

Responding to this to disable this list without an answer.

The output code indicates that a kernel start defines a threadblock from 1 (active) thread. This is not a way to write fast graphic code, as it will leave most of the GPU's capabilities inactive.

Typical threadblock block sizes should be at least 128 threads per block, and higher is often better, a multiple of 32, to a limit of 512 or 1024 per block, depending on the GPU.

The GPU "loves" to hide latency, having a lot of parallel work "available." Specifying more threads per block helps with this. (Having a fairly large number of threads in the grid can also help.)

In addition, the graphics processor executes threads in groups of 32. Setting only 1 thread per block or multiple of 32 will leave several incomplete execution slots in each executable stream block. 1 thread per block is especially bad.

+7
source

Source: https://habr.com/ru/post/944488/


All Articles