Typing in CUDA and cuBLAS

I am writing a program in cuda and I am trying to reduce data transfer overhead. I use the cuBLAS library to multiply matrices, and I need to send 30,000,000 numbers whose values ​​range from 0 to 255.

Now I am sending them as a float, since I want my final product to be a float, which ends up being quite expensive, given that they can fit in bytes.

Is there a way to send them as bytes and come up with them as a float when using the cuBLAS library or any other fast-Math library? Or tell gpu to align them like floats in some way?

+4
source share
1 answer

You can cudaMemcpy create an unsigned char array from the host to the device, and also allocate the float array on the device using cudaMalloc . then write a custom kernel that copies from the byte array to the float array:

 __global__ void byteToFloat(float *out, unsigned char* in, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; for (; i < n; i += gridDim.x * blockDim.x) out[i] = in[i]; } 

If your host data is already stored as a float, this may be slower than copying the floats. Try and see. But if your array is already of type unsigned char , then you still need to do this conversion, so the above will probably be efficient.

Note that for best performance, you can probably try to copy the copy and calculate if possible (but this is beyond the scope of the question: see the CUDA best practice guide and programming guide for information on cudaMemcpyAsync .)

+3
source

Source: https://habr.com/ru/post/1394924/


All Articles