Effectively transfer a large file (up to 2 GB) to a CUDA GPU?

Question

Effectively transfer a large file (up to 2 GB) to a CUDA GPU?

I am working on an accelerated GPU program that requires reading an entire file with a variable size. My question is: what is the optimal number of bytes to read from a file and transfer to a coprocessor (CUDA device)?

These files can be like 2GiB, so creating a buffer of this size does not seem like a good idea.

+6

io large-files cuda file-transfer bandwidth

sj755 Mar 16 '12 at 3:02

source share

2 answers

If you can split your function so that you can work on chunks on a map, you should study the use of streams (cudaStream_t).

If you plan to load and run the kernel in multiple threads, you can have one data stream, while the other runs the kernel on the map, thereby hiding some data transfer time when the kernel runs.

You need to declare a buffer, regardless of the size of your chunk, but how many threads you declare (up to 16, to be able to calculate 1.x, as far as I know).

0

P O'Conbhui Mar 27 '12 at 1:49

source share

Ashwin nanjappa · Accepted Answer · 2012-03-16T03:07:05+0000

You can cudaMalloc use the maximum size buffer on your device. After that, copy the fragments of your input data of this size from the host to the device, process it, copy the results and continue.

// Your input data on host int hostBufNum = 5600000; int* hostBuf = ...; // Assume this is largest device buffer you can allocate int devBufNum = 1000000; int* devBuf; cudaMalloc( &devBuf, sizeof( int ) * devBufNum ); int* hostChunk = hostBuf; int hostLeft = hostBufNum; int chunkNum = ( hostLeft < devBufNum ) ? hostLeft : devBufNum; do { cudaMemcpy( devBuf, hostChunk, chunkNum * sizeof( int ) , cudaMemcpyHostToDevice); doSomethingKernel<<< >>>( devBuf, chunkNum ); hostChunk = hostChunk + chunkNum; hostLeft = hostBufNum - ( hostChunk - hostBuf ); } while( hostLeft > 0 );

Effectively transfer a large file (up to 2 GB) to a CUDA GPU?

More articles: