You can cudaMemcpy create an unsigned char array from the host to the device, and also allocate the float array on the device using cudaMalloc . then write a custom kernel that copies from the byte array to the float array:
__global__ void byteToFloat(float *out, unsigned char* in, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; for (; i < n; i += gridDim.x * blockDim.x) out[i] = in[i]; }
If your host data is already stored as a float, this may be slower than copying the floats. Try and see. But if your array is already of type unsigned char , then you still need to do this conversion, so the above will probably be efficient.
Note that for best performance, you can probably try to copy the copy and calculate if possible (but this is beyond the scope of the question: see the CUDA best practice guide and programming guide for information on cudaMemcpyAsync .)
source share