If you can split your function so that you can work on chunks on a map, you should study the use of streams (cudaStream_t).
If you plan to load and run the kernel in multiple threads, you can have one data stream, while the other runs the kernel on the map, thereby hiding some data transfer time when the kernel runs.
You need to declare a buffer, regardless of the size of your chunk, but how many threads you declare (up to 16, to be able to calculate 1.x, as far as I know).
source share