How the opencl command queue works, and what can I ask for

I am working on an algorithm that does almost the same operation a bunch of times. Since the operation consists of some linear algebra (BLAS), I am trying to use the GPU for this.

I wrote my kernel and started pushing kernels on the command line. Since I do not want to wait after each call, I think that I will try to associate my calls with events and just start pushing them into the queue.

call kernel1(return event1) call kernel2(wait for event 1, return event 2) ... call kernel1000000(vait for event 999999) 

Now my question is: does all this fall into the graphics chip, does the driver keep the queue? There is a link to the number of events that I can use, or to the length of the command queue, I looked around, but I could not find it.

I use atMonitor to test my gpu 'usage, and its pretty hard to click on it above 20%, could it just be becaurse? Can't I dial calls so fast? My data is already stored on the GPU, and all I transfer is the actual calls.

+6
source share
2 answers

First, you should not wait for events from the previous kernel, unless the next kernel has data dependencies on the previous kernel. Using the device (usually) depends on the fact that there is always something ready to go in the queue. Wait only for the event when you need to wait for the event.

"it all falls into the graphics chip, does the driver keep the queue?"

This implementation is defined. Remember, OpenCL works not only with GPUs! As for the CUDA-style device / host dichotomy, you should probably consider command line operations (for most implementations) on the "host".

Try to make several kernel calls without waiting between them. Also, make sure that you are using the optimal workgroup size. If you both do this, you can make the most of your device.

+4
source

Unfortunately, I do not know the answers to all your questions, and now I have a question about the same, but I can say that I doubt that the OpenCL queue will ever become full, since you must finish the last command in the queue before sending at least 20 teams. This is only true if your GPU has a watchdog, because it can stop ridiculously long cores (I think, 5 seconds or more) from executing.

+1
source

Source: https://habr.com/ru/post/894838/


All Articles