CUDA device to transfer device expensive

Question

CUDA device to transfer device expensive

I wrote some code to try changing the quadrants of a 2D matrix for FFT purposes, which are stored in a flat array.

int leftover = W-dcW; T *temp; T *topHalf; cudaMalloc((void **)&temp, dcW * sizeof(T)); //swap every row, left and right for(int i = 0; i < H; i++) { cudaMemcpy(temp, &data[i*W], dcW*sizeof(T),cudaMemcpyDeviceToDevice); cudaMemcpy(&data[i*W],&data[i*W+dcW], leftover*sizeof(T), cudaMemcpyDeviceToDevice); cudaMemcpy(&data[i*W+leftover], temp, dcW*sizeof(T), cudaMemcpyDeviceToDevice); } cudaMalloc((void **)&topHalf, dcH*W* sizeof(T)); leftover = H-dcH; cudaMemcpy(topHalf, data, dcH*W*sizeof(T), cudaMemcpyDeviceToDevice); cudaMemcpy(data, &data[dcH*W], leftover*W*sizeof(T), cudaMemcpyDeviceToDevice); cudaMemcpy(&data[leftover*W], topHalf, dcH*W*sizeof(T), cudaMemcpyDeviceToDevice);

Please note that this code contains pointers to devices and carries DeviceToDevice.

Why does it seem slow? Could this be somehow optimized? I calculated this compared to the same operation on the host using regular memcpy, and it was about 2 times slower.

Any ideas?

+4

c ++ fft cuda

Derek May 19 '11 at 19:02

source share

2 answers

Perhaps the following solution for doing 2d fftshift in CUDA would be interesting:

 #define IDX2R(i,j,N) (((i)*(N))+(j)) __global__ void fftshift_2D(double2 *data, int N1, int N2) { int i = threadIdx.y + blockDim.y * blockIdx.y; int j = threadIdx.x + blockDim.x * blockIdx.x; if (i < N1 && j < N2) { double a = pow(-1.0, (i+j)&1); data[IDX2R(i,j,N2)].x *= a; data[IDX2R(i,j,N2)].y *= a; } }

It consists in multiplying the matrix, which must be transformed by a chessboard of 1 and -1 , which is equivalent to multiplying by exp(-j*(n+m)*pi) and, therefore, shifts in both directions in the conjugate region.

You must call this kernel before and after the CUFFT application.

One of them is that memory moves / bypasses are excluded.

SPEED IMPROVEMENT

Following the suggestion received on the NVIDIA forum, improved speed can be achieved both by changing the instructions

 double a = pow(-1.0,(i+j)&1);

to

 double a = 1-2*((i+j)&1);

to avoid using a slow routine.

+3

Jackolantern Jan 7 '13 at 20:17

source share

Derek · Accepted Answer · 2011-05-27T14:40:58+0000

In the end, I wrote a kernel for swaps. It was really faster than memcpy device operations with the device

CUDA device to transfer device expensive

More articles: