I am working on an algorithm that does almost the same operation a bunch of times. Since the operation consists of some linear algebra (BLAS), I am trying to use the GPU for this.
I wrote my kernel and started pushing kernels on the command line. Since I do not want to wait after each call, I think that I will try to associate my calls with events and just start pushing them into the queue.
call kernel1(return event1) call kernel2(wait for event 1, return event 2) ... call kernel1000000(vait for event 999999)
Now my question is: does all this fall into the graphics chip, does the driver keep the queue? There is a link to the number of events that I can use, or to the length of the command queue, I looked around, but I could not find it.
I use atMonitor to test my gpu 'usage, and its pretty hard to click on it above 20%, could it just be becaurse? Can't I dial calls so fast? My data is already stored on the GPU, and all I transfer is the actual calls.
source share