Update . The following results are for the 2005 handwritten FFT graphics processor algorithm (nVidia 7800 GTX), but demonstrate the bottleneck principle in the CPU-GPU
Overhead is not a challenge, but a compilation of the GPU program and data transfer between the GPU and the host. The CPU is very optimized for functions that can be run completely in the cache, and the DDR3 memory latency is much lower than the PCI-Express bus serving the GPU. I myself experienced this when writing GPU FFT routines (before CUDA). See this related question .
N FFTw (s) GPUFFT (s) GPUFFT MFLOPS GPUFFT Speedup
8 0 0.00006 3.352705 0.006881
16 0.000001 0.000065 7.882117 0.010217
32 0.000001 0.000075 17.10887 0.014695
64 0.000002 0.000085 36.080118 0.026744
128 0.000004 0.000093 76.724324 0.040122
256 0.000007 0.000107 153.739856 0.066754
512 0.000015 0.000115 320.200892 0.134614
1024 0.000034 0.000125 657.735381 0.270512
2048 0.000076 0.000156 1155.151507 0.484331
4096 0.000173 0.000215 1834.212989 0.804558
8192 0.000483 0.00032 2664.042421 1.510011
16384 0.001363 0.000605 3035.4551 2.255411
32768 0.003168 0.00114 3450.455808 2.780041
65536 0.008694 0.002464 3404.628083 3.528726
131072 0.015363 0.005027 3545.850483 3.05604
262144 0.033223 0.012513 3016.885246 2.655183
524288 0.072918 0.025879 3079.443664 2.817667
1048576 0.173043 0.076537 2192.056517 2.260904
2097152 0.331553 0.157427 2238.01491 2.106081
4194304 0.801544 0.430518 1715.573229 1.861814
The table above shows the timings of the FFT GPU implementation and CPU implementation based on core size. For smaller sizes, data transfer to / from the GPU dominates. Smaller cores can be executed on the CPU, some implementations / sizes are completely in the cache. This makes the CPU the best choice for small operations.
If, on the other hand, you need to perform large batches of work with data with minimal movement to / from the GPU, then the GPU will beat the CPU down.
As for measuring the effect in your example, I would suggest doing an experiment like the one above. Try to develop FLOPS calculated for each matrix size, and run a test on the CPU and GPU for different matrix sizes. Output the CSV file size, time and FLOPS for the GPU and CPU. For any profiling, make sure you run several hundred iterations of your code and time, and then divide the total time into iterations to get the cycle time. Try different forms of the matrix as well, if your algorithm allows (e.g. 10x100, not 100x10).
Using this data, you can understand what overhead is. To pinpoint the exact same experiment, but replace the internal shader code that runs on the GPU without an operation (just copy from input to output).
Hope this helps,
source share