The wave simulator with which I worked with C # + Cudafy (C # โ CUDA or OpenCL translator) works fine, except that the OpenCL processor version (Intel driver, 15 "MacBook Pro Retina i7 2.7 GHz, GeForce 650M ( Kepler, 384 cores)) is about four times faster than the GPU version.
(This happens whether I use the CL server or the CUDA GPU. The versions of the OpenCL GPU and CUDA work almost the same way.)
To clarify, for an example of a problem:
- OpenCL CPU 1200 Hz
- OpenCL GPU 320 Hz
- CUDA GPU - ~ 330 Hz
I find it difficult to explain why the processor version will be faster than the GPU. In this case, the kernel code that executes (in the case of CL) on the CPU and GPU is identical. During initialization, I choose a processor or GPU device, but apart from that, everything is identical.
Edit
Here's the C # code starting one of the cores. (The rest are very similar.)
public override void UpdateEz(Source source, float Time, float ca, float cb) { var blockSize = new dim3(1); var gridSize = new dim3(_gpuEz.Field.GetLength(0),_gpuEz.Field.GetLength(1)); Gpu.Launch(gridSize, blockSize) .CudaUpdateEz( Time , ca , cb , source.Position.X , source.Position.Y , source.Value , _gpuHx.Field , _gpuHy.Field , _gpuEz.Field ); }
And so, the corresponding CUDA kernel function generated by Cudafy:
extern "C" __global__ void CudaUpdateEz(float time, float ca, float cb, int sourceX, int sourceY, float sourceValue, float* hx, int hxLen0, int hxLen1, float* hy, int hyLen0, int hyLen1, float* ez, int ezLen0, int ezLen1) { int x = blockIdx.x; int y = blockIdx.y; if (x > 0 && x < ezLen0 - 1 && y > 0 && y < ezLen1 - 1) { ez[(x) * ezLen1 + ( y)] = ca * ez[(x) * ezLen1 + ( y)] + cb * (hy[(x) * hyLen1 + ( y)] - hy[(x - 1) * hyLen1 + ( y)]) - cb * (hx[(x) * hxLen1 + ( y)] - hx[(x) * hxLen1 + ( y - 1)]); } if (x == sourceX && y == sourceY) { ez[(x) * ezLen1 + ( y)] += sourceValue; } }
Just for completeness, here is C #, which is used to generate CUDA:
[Cudafy] public static void CudaUpdateEz( GThread thread , float time , float ca , float cb , int sourceX , int sourceY , float sourceValue , float[,] hx , float[,] hy , float[,] ez ) { var i = thread.blockIdx.x; var j = thread.blockIdx.y; if (i > 0 && i < ez.GetLength(0) - 1 && j > 0 && j < ez.GetLength(1) - 1) ez[i, j] = ca * ez[i, j] + cb * (hy[i, j] - hy[i - 1, j]) - cb * (hx[i, j] - hx[i, j - 1]) ; if (i == sourceX && j == sourceY) ez[i, j] += sourceValue; }
Obviously, if in this kernel is bad, but even the resulting pipeline should not cause such an extreme performance delta.
The only other thing that jumps at me is that I use a chrome grid / block distribution scheme - i.e. the grid represents the size of the updated array, and each block is a single thread. I am sure that this has some impact on performance, but I donโt see it being 1/4 of the speed of the CL code running on the CPU. ARGH!