I am sure that this was answered earlier, but I can not find a good explanation.
I am writing a graphics program in which part of the pipeline copies voxel data to OpenCL memory with page locking (pinned). I found that this copy procedure is a bottleneck and made some performance measurements of a simple std::copy . The data is floats, and each piece of data that I want to copy is about 64 MB in size.
This is my original code before any benchmarking attempts:
std::copy(data, data+numVoxels, pinnedPointer_[_index]);
Where data is a pointer to a float, numVoxels is an unsigned int, and pinnedPointer_[_index] is a float pointer that refers to an OpenCL buffer with a buffer.
Since I got slow performance, I decided to try copying smaller parts of the data and see what my bandwidth is. I used boost :: cpu_timer for synchronization. I tried to run it for some time, and also averaged over a couple of hundred runs, getting similar results. Here is the relevant code along with the results:
boost::timer::cpu_timer t; unsigned int testNum = numVoxels; while (testNum > 2) { t.start(); std::copy(data, data+testNum, pinnedPointer_[_index]); t.stop(); boost::timer::cpu_times result = t.elapsed(); double time = (double)result.wall / 1.0e9 ; int size = testNum*sizeof(float); double GB = (double)size / 1073741842.0; // Print results testNum /= 2; } Copied 67108864 bytes in 0.032683s, 1.912315 GB/s Copied 33554432 bytes in 0.017193s, 1.817568 GB/s Copied 16777216 bytes in 0.008586s, 1.819749 GB/s Copied 8388608 bytes in 0.004227s, 1.848218 GB/s Copied 4194304 bytes in 0.001886s, 2.071705 GB/s Copied 2097152 bytes in 0.000819s, 2.383543 GB/s Copied 1048576 bytes in 0.000290s, 3.366923 GB/s Copied 524288 bytes in 0.000063s, 7.776913 GB/s Copied 262144 bytes in 0.000016s, 15.741867 GB/s Copied 131072 bytes in 0.000008s, 15.213149 GB/s Copied 65536 bytes in 0.000004s, 14.374742 GB/s Copied 32768 bytes in 0.000003s, 10.209962 GB/s Copied 16384 bytes in 0.000001s, 10.344942 GB/s Copied 8192 bytes in 0.000001s, 6.476566 GB/s Copied 4096 bytes in 0.000001s, 4.999603 GB/s Copied 2048 bytes in 0.000001s, 1.592111 GB/s Copied 1024 bytes in 0.000001s, 1.600125 GB/s Copied 512 bytes in 0.000001s, 0.843960 GB/s Copied 256 bytes in 0.000001s, 0.210990 GB/s Copied 128 bytes in 0.000001s, 0.098439 GB/s Copied 64 bytes in 0.000001s, 0.049795 GB/s Copied 32 bytes in 0.000001s, 0.049837 GB/s Copied 16 bytes in 0.000001s, 0.023728 GB/s
There is a clear peak in throughput when copying blocks of size 65536-262144 bytes, and throughput is much higher than copying a full array (15 versus 2 GB / s).
Knowing this, I decided to try one more thing and copied the full array, but using repeated calls to std::copy , where each call processed only part of the array. Trying different block sizes, these are my results:
unsigned int testNum = numVoxels; unsigned int parts = 1; while (sizeof(float)*testNum > 256) { t.start(); for (unsigned int i=0; i<parts; ++i) { std::copy(data+i*testNum, data+(i+1)*testNum, pinnedPointer_[_index]+i*testNum); } t.stop(); boost::timer::cpu_times result = t.elapsed(); double time = (double)result.wall / 1.0e9; int size = testNum*sizeof(float); double GB = parts*(double)size / 1073741824.0; // Print results parts *= 2; testNum /= 2; } Part size 67108864 bytes, copied 0.0625 GB in 0.0331298s, 1.88652 GB/s Part size 33554432 bytes, copied 0.0625 GB in 0.0339876s, 1.83891 GB/s Part size 16777216 bytes, copied 0.0625 GB in 0.0342558s, 1.82451 GB/s Part size 8388608 bytes, copied 0.0625 GB in 0.0334264s, 1.86978 GB/s Part size 4194304 bytes, copied 0.0625 GB in 0.0287896s, 2.17092 GB/s Part size 2097152 bytes, copied 0.0625 GB in 0.0289941s, 2.15561 GB/s Part size 1048576 bytes, copied 0.0625 GB in 0.0240215s, 2.60184 GB/s Part size 524288 bytes, copied 0.0625 GB in 0.0184499s, 3.38756 GB/s Part size 262144 bytes, copied 0.0625 GB in 0.0186002s, 3.36018 GB/s Part size 131072 bytes, copied 0.0625 GB in 0.0185958s, 3.36097 GB/s Part size 65536 bytes, copied 0.0625 GB in 0.0185735s, 3.365 GB/s Part size 32768 bytes, copied 0.0625 GB in 0.0186523s, 3.35079 GB/s Part size 16384 bytes, copied 0.0625 GB in 0.0187756s, 3.32879 GB/s Part size 8192 bytes, copied 0.0625 GB in 0.0182212s, 3.43007 GB/s Part size 4096 bytes, copied 0.0625 GB in 0.01825s, 3.42465 GB/s Part size 2048 bytes, copied 0.0625 GB in 0.0181881s, 3.43631 GB/s Part size 1024 bytes, copied 0.0625 GB in 0.0180842s, 3.45605 GB/s Part size 512 bytes, copied 0.0625 GB in 0.0186669s, 3.34817 GB/s
It seems like reducing the size of the chunk actually has a significant effect, but I still can't get around 15 GB / s.
I am running 64-bit Unbuntu, optimizing GCC is not a big deal.
- Why does array size affect bandwidth this way?
- Does OpenCL memory support memory?
- What are the strategies for optimizing a large copy of an array?