Array size and copy performance

I am sure that this was answered earlier, but I can not find a good explanation.

I am writing a graphics program in which part of the pipeline copies voxel data to OpenCL memory with page locking (pinned). I found that this copy procedure is a bottleneck and made some performance measurements of a simple std::copy . The data is floats, and each piece of data that I want to copy is about 64 MB in size.

This is my original code before any benchmarking attempts:

 std::copy(data, data+numVoxels, pinnedPointer_[_index]); 

Where data is a pointer to a float, numVoxels is an unsigned int, and pinnedPointer_[_index] is a float pointer that refers to an OpenCL buffer with a buffer.

Since I got slow performance, I decided to try copying smaller parts of the data and see what my bandwidth is. I used boost :: cpu_timer for synchronization. I tried to run it for some time, and also averaged over a couple of hundred runs, getting similar results. Here is the relevant code along with the results:

 boost::timer::cpu_timer t; unsigned int testNum = numVoxels; while (testNum > 2) { t.start(); std::copy(data, data+testNum, pinnedPointer_[_index]); t.stop(); boost::timer::cpu_times result = t.elapsed(); double time = (double)result.wall / 1.0e9 ; int size = testNum*sizeof(float); double GB = (double)size / 1073741842.0; // Print results testNum /= 2; } Copied 67108864 bytes in 0.032683s, 1.912315 GB/s Copied 33554432 bytes in 0.017193s, 1.817568 GB/s Copied 16777216 bytes in 0.008586s, 1.819749 GB/s Copied 8388608 bytes in 0.004227s, 1.848218 GB/s Copied 4194304 bytes in 0.001886s, 2.071705 GB/s Copied 2097152 bytes in 0.000819s, 2.383543 GB/s Copied 1048576 bytes in 0.000290s, 3.366923 GB/s Copied 524288 bytes in 0.000063s, 7.776913 GB/s Copied 262144 bytes in 0.000016s, 15.741867 GB/s Copied 131072 bytes in 0.000008s, 15.213149 GB/s Copied 65536 bytes in 0.000004s, 14.374742 GB/s Copied 32768 bytes in 0.000003s, 10.209962 GB/s Copied 16384 bytes in 0.000001s, 10.344942 GB/s Copied 8192 bytes in 0.000001s, 6.476566 GB/s Copied 4096 bytes in 0.000001s, 4.999603 GB/s Copied 2048 bytes in 0.000001s, 1.592111 GB/s Copied 1024 bytes in 0.000001s, 1.600125 GB/s Copied 512 bytes in 0.000001s, 0.843960 GB/s Copied 256 bytes in 0.000001s, 0.210990 GB/s Copied 128 bytes in 0.000001s, 0.098439 GB/s Copied 64 bytes in 0.000001s, 0.049795 GB/s Copied 32 bytes in 0.000001s, 0.049837 GB/s Copied 16 bytes in 0.000001s, 0.023728 GB/s 

There is a clear peak in throughput when copying blocks of size 65536-262144 bytes, and throughput is much higher than copying a full array (15 versus 2 GB / s).

Knowing this, I decided to try one more thing and copied the full array, but using repeated calls to std::copy , where each call processed only part of the array. Trying different block sizes, these are my results:

 unsigned int testNum = numVoxels; unsigned int parts = 1; while (sizeof(float)*testNum > 256) { t.start(); for (unsigned int i=0; i<parts; ++i) { std::copy(data+i*testNum, data+(i+1)*testNum, pinnedPointer_[_index]+i*testNum); } t.stop(); boost::timer::cpu_times result = t.elapsed(); double time = (double)result.wall / 1.0e9; int size = testNum*sizeof(float); double GB = parts*(double)size / 1073741824.0; // Print results parts *= 2; testNum /= 2; } Part size 67108864 bytes, copied 0.0625 GB in 0.0331298s, 1.88652 GB/s Part size 33554432 bytes, copied 0.0625 GB in 0.0339876s, 1.83891 GB/s Part size 16777216 bytes, copied 0.0625 GB in 0.0342558s, 1.82451 GB/s Part size 8388608 bytes, copied 0.0625 GB in 0.0334264s, 1.86978 GB/s Part size 4194304 bytes, copied 0.0625 GB in 0.0287896s, 2.17092 GB/s Part size 2097152 bytes, copied 0.0625 GB in 0.0289941s, 2.15561 GB/s Part size 1048576 bytes, copied 0.0625 GB in 0.0240215s, 2.60184 GB/s Part size 524288 bytes, copied 0.0625 GB in 0.0184499s, 3.38756 GB/s Part size 262144 bytes, copied 0.0625 GB in 0.0186002s, 3.36018 GB/s Part size 131072 bytes, copied 0.0625 GB in 0.0185958s, 3.36097 GB/s Part size 65536 bytes, copied 0.0625 GB in 0.0185735s, 3.365 GB/s Part size 32768 bytes, copied 0.0625 GB in 0.0186523s, 3.35079 GB/s Part size 16384 bytes, copied 0.0625 GB in 0.0187756s, 3.32879 GB/s Part size 8192 bytes, copied 0.0625 GB in 0.0182212s, 3.43007 GB/s Part size 4096 bytes, copied 0.0625 GB in 0.01825s, 3.42465 GB/s Part size 2048 bytes, copied 0.0625 GB in 0.0181881s, 3.43631 GB/s Part size 1024 bytes, copied 0.0625 GB in 0.0180842s, 3.45605 GB/s Part size 512 bytes, copied 0.0625 GB in 0.0186669s, 3.34817 GB/s 

It seems like reducing the size of the chunk actually has a significant effect, but I still can't get around 15 GB / s.

I am running 64-bit Unbuntu, optimizing GCC is not a big deal.

  • Why does array size affect bandwidth this way?
  • Does OpenCL memory support memory?
  • What are the strategies for optimizing a large copy of an array?
+4
source share
1 answer

I'm sure you work in caching. If you fill the cache with the data that you wrote, next time some data will be required, the cache will have to read this data from memory, but FIRST it needs to find some space in the cache, because all the data [or at least many of them ] "dirty" because it was written, it must be written in RAM. Then we write a new bit of data to the cache, which produces another bit of data that is dirty (or something that we read earlier).

In assembler, we can overcome this by using a "non-temporal" move instruction. For example, the SSE instruction movntps . This instruction will "avoid storing things in the cache."

Edit: you can also improve performance without mixing read and write - use a small buffer [fixed size array], say 4-16 KB, and copy the data to this buffer, and then write this buffer to a new place where you want it. Again, ideally, use a non-transitory write, as this will improve throughput even in this case - but just using “blocks” to read and then write rather than read one, write one, will be much faster.

Something like that:

  float temp[2048]; int left_to_do = numVoxels; int offset = 0; while(left_to_do) { int block = min(left_to_do, sizeof(temp)/sizeof(temp[0]); std::copy(data+offset, data+offset+block, temp); std::copy(temp, temp+block, pinnedPointer_[_index+offet]); offset += block; left_to_do -= block; } 

Try this and see if it improves the situation. It may not ...

Edit2: I have to explain that this happens faster because you reuse the same cache bit to load data each time, and without mixing read and write, we get the best performance from the memory itself.

+5
source

Source: https://habr.com/ru/post/1481833/


All Articles