Overhead for calling OpenCL or CUDA?

I am writing a function that performs many BLAS gemv operations.

I would like to be able to do this on the GPU, and I tried cuBlas.

My problem is that my matrix and vectors are quite small, the matrix is ​​100x100 and 100 vectors. CuBlas takes a lot of time compared to the processor, and I understand why the mixture of fast cache on the processor and the high overhead when making calls on the GPU.

So I'm trying to find a reasonable way to measure the time it takes to transfer a call to the GPU.

This is the time it takes for CUDA to configure the call and send it to the GPU - not counting the time it actually takes to multiply the matrix vector.

How can I do it?

+4
source share
4 answers

Update . The following results are for the 2005 handwritten FFT graphics processor algorithm (nVidia 7800 GTX), but demonstrate the bottleneck principle in the CPU-GPU

Overhead is not a challenge, but a compilation of the GPU program and data transfer between the GPU and the host. The CPU is very optimized for functions that can be run completely in the cache, and the DDR3 memory latency is much lower than the PCI-Express bus serving the GPU. I myself experienced this when writing GPU FFT routines (before CUDA). See this related question .

  N FFTw (s) GPUFFT (s) GPUFFT MFLOPS GPUFFT Speedup
 8 0 0.00006 3.352705 0.006881
 16 0.000001 0.000065 7.882117 0.010217
 32 0.000001 0.000075 17.10887 0.014695
 64 0.000002 0.000085 36.080118 0.026744
 128 0.000004 0.000093 76.724324 0.040122
 256 0.000007 0.000107 153.739856 0.066754
 512 0.000015 0.000115 320.200892 0.134614
 1024 0.000034 0.000125 657.735381 0.270512
 2048 0.000076 0.000156 1155.151507 0.484331
 4096 0.000173 0.000215 1834.212989 0.804558
 8192 0.000483 0.00032 2664.042421 1.510011
 16384 0.001363 0.000605 3035.4551 2.255411
 32768 0.003168 0.00114 3450.455808 2.780041
 65536 0.008694 0.002464 3404.628083 3.528726
 131072 0.015363 0.005027 3545.850483 3.05604
 262144 0.033223 0.012513 3016.885246 2.655183
 524288 0.072918 0.025879 3079.443664 2.817667
 1048576 0.173043 0.076537 2192.056517 2.260904
 2097152 0.331553 0.157427 2238.01491 2.106081
 4194304 0.801544 0.430518 1715.573229 1.861814

The table above shows the timings of the FFT GPU implementation and CPU implementation based on core size. For smaller sizes, data transfer to / from the GPU dominates. Smaller cores can be executed on the CPU, some implementations / sizes are completely in the cache. This makes the CPU the best choice for small operations.

If, on the other hand, you need to perform large batches of work with data with minimal movement to / from the GPU, then the GPU will beat the CPU down.

As for measuring the effect in your example, I would suggest doing an experiment like the one above. Try to develop FLOPS calculated for each matrix size, and run a test on the CPU and GPU for different matrix sizes. Output the CSV file size, time and FLOPS for the GPU and CPU. For any profiling, make sure you run several hundred iterations of your code and time, and then divide the total time into iterations to get the cycle time. Try different forms of the matrix as well, if your algorithm allows (e.g. 10x100, not 100x10).

Using this data, you can understand what overhead is. To pinpoint the exact same experiment, but replace the internal shader code that runs on the GPU without an operation (just copy from input to output).

Hope this helps,

+8
source

You can get the time in nanoseconds from the device when the event was queued, sent, started, and completed using clGetEventProfilingInfo in the buffer transfer event.

more information and how to configure it here: http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetEventProfilingInfo.html

I think for 100x100 matrices you might be better off sticking to a crunch processor. Unless you manage to multiply a lot at the same time, the advantage of gpu will be barely noticeable due to the (small) transmission costs and usually much lower clock speeds. Make sure you configure your kernel to use as much local data as possible - on my hardware there are 32 KB per workgroup, and this should be a lot to hold two 100x100 matrices. Functions of the built-in functions of points should also be very convenient.

Last year in ADFS there was an amazing conversation about this (see sessionId: 2908) http://developer.amd.com/afds/pages/OLD/sessions.aspx They discuss in detail kernel optimization and hard coding of optimal sizes.

+1
source

Are your matrices already on the GPU? If not, CUBLAS can transfer them to you (known as thunking), which is an additional cost.

In addition, GPUs are not really shiny for such small calculations, i.e. they are likely to be slower than processors since you need to get your result back. If possible, use larger matrices. Otherwise, you can use streams (cudaStream_t) to run multiple parallel computations on the GPU.

If you want to measure the kernel runtime in CUDA, you need to enclose this (or something else that calculates on the GPU) in events, for example, when using the CUDA runtime API:

cudaEvent_t start, stop; cudaEventRecord(&start); struct timeval cpuStart, cpuEnd; gettimeofday(&cpuStart, 0); // get start time on CPU // Do something with CUDA on the GPU, eg call kernels, transfer memory, ... gettimeofday(&cpuEnd, 0); // get end time on CPU double seconds = cpuEnd.tv_sec - cpuStart.tv_sec; double microseconds = cpuEnd.tv_usec - cpuStart.tv_usec; double cpuDuration = (seconds * 1.0e6 + microseconds) / 1.0e3; // in milliseconds cudaEventRecord(&stop); // Wait until the stop event occurred cudaError_t eventResult; do { eventResult = cudaEventQuery(stop); } while (eventResult == cudaErrorNotReady); // Assert there was no error; check the CUDA Toolkit Reference for further info assert(cudaSuccess == eventResult); // requires #include <assert.h> or <cassert> // Retrieve the time float gpuDuration = 0.0; // in milliseconds cudaEventElapsedTime(&gpuDuration, start, stop); // Release the event objects cudaEventDestroy(stop); cudaEventDestroy(start); 

You might want to check the error code for each CUDA call (at least with assert), as you can get errors from previous calls, which will lead to hours of debugging ...

(Note: I mainly use the CUDA driver API, so this may not work. Sorry for that.)

EDIT: Just saw that you want to measure the call itself, not the kernel duration. You can do this by simply measuring the time on the processor to call - see the updated code above. This only works on Linux because gettimeofday is not available for Windows (AFAIK).

+1
source

To find the call overhead, call the CUDA kernel, which will do as little as possible.

 for (int i=0; i<NLoops; i++) { gettimeofday(&cpuStart, 0); // get start time on CPU // Call minimal CUDA kernel gettimeofday(&cpuEnd, 0); // get end time on CPU // save elapsed time } 

Follow the Alex P. code above.

The less processing performed in the kernel, the greater the time difference will be only overhead.

Do some experimentation to find a good value for NLoops (possibly 1,000,000). Make sure that the elapsed time is longer than the interval of your timer, or you will get all zeros. If this happens, write some kernel code that runs in a fixed time interval, which you can predict: (n cycles of x cycles each).

It is difficult to remove all non-CUDA calculations that may occur between cpuStart and cpuEnd (for example, handling interrupts), but doing multiple runs and averaging can give good results.

+1
source

Source: https://habr.com/ru/post/1392657/


All Articles