OpenGL / OpenCL Interop Performance in glBindTexture (), glBegin ()

I am working on an OS X application in a multi-GPU setup (Mac Pro at the end of 2013) that uses OpenCL (on the secondary GPU) to create a texture that is later drawn onto the screen using OpenGL (on the main GPU). The application is tied to the processor due to calls to glBindTexture () and glBegin (), both of which spend basically all their time on:

_platform_memmove$VARIANT$Ivybridge 

which is part of the video driver:

 AMDRadeonX4000GLDriver 

Customization: creates an OpenGL texture ( glPixelBuffer ) and then an instance of OpenCL ( clPixelBuffer ).

 cl_int clerror = 0; GLuint glPixelBuffer = 0; cl_mem clPixelBuffer = 0; glGenTextures(1, &glPixelBuffer); glBindTexture(GL_TEXTURE_2D, glPixelBuffer); glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR); glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR); glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, 2048, 2048, 0, GL_RGBA, GL_FLOAT, NULL); glBindTexture(GL_TEXTURE_2D, 0); clPixelBuffer = clCreateFromGLTexture(_clShareGroupContext, CL_MEM_WRITE_ONLY, GL_TEXTURE_2D, 0, glPixelBuffer, &clerror); 

Drawing Code: Maps the OpenGL texture to the viewport. All NSOpenGLView is just one texture.

 glClear(GL_COLOR_BUFFER_BIT); glBindTexture(GL_TEXTURE_2D, _glPixelBuffer); // <- spends cpu time here, glBegin(GL_QUADS); // <- and here glTexCoord2f(0., 0.); glVertex3f(-1.f, 1.f, 0.f); glTexCoord2f(0., hr); glVertex3f(-1.f, -1.f, 0.f); glTexCoord2f(wr, hr); glVertex3f( 1.f, -1.f, 0.f); glTexCoord2f(wr, 0.); glVertex3f( 1.f, 1.f, 0.f); glEnd(); glBindTexture(GL_TEXTURE_2D, 0); glFlush(); 

After gaining control over texture memory (via clEnqueueAcquireGLObjects () ), the OpenCL core writes data to the texture and then releases control (through clEnqueueReleaseGLObjects () ). Texture data should never exist in main memory (if I understand everything correctly).

My question is: was it expected that so much CPU time was spent on memmove ()? Is this an indication of a problem in my code? Maybe a driver error? My (unreasonable) suspicion is that the texture data is moving through: GPUx → CPU / RAM → GPUy, which I would like to avoid.

+5
source share
1 answer

Before I touch on memory transfer, my first observation is that you use clBegin (), which will not be your best friend, because

1) This direct drawing does not work very well with the driver. Instead, use VBOs, etc., so that this data can live on the GPU.

2) On OS X, this means that you are in your old compatibility context, not your new base context. Since (I understand) the new context is a complete rewrite, this will mean that future optimizations will be in the end, while the context you use is (possibly) just maintained.

So, for memory transfer .... on the GL side do you put glCreateSyncFromCLeventARB () and glWaitSync () on this? There should be no need for glFlush (), which I see in your code. Once you get rid of the immediate mode graph (as mentioned above) and use the synchronization objects between the two APIs, your host code should not do anything (except that the driver asks you to say that the GPU is doing something). This will give you the best chance of a speedy copy of the buffer ....

Yes, copies :( Since your CL texture physically lives on a different part of the GPU memory for the GL texture, there must be a copy on the PCIe bus that will be slow (er). This is what you see in your profiling. In fact, what the CPU is doing is maps the GPU A memory and GPU memory to the host's allocated memory and then copies between them (hopefully) with DMA. I doubt that the data actually refers to system memory, so the transition is GPUx → GPUy.

Try putting the CL and GL contexts on the same GPU, and I think you will see that your transfer time disappears.

Final thought: if your CL calculation is overshadowed by transmission time, it is probably best to use contexts on the same CPU. You have a classic CPU / GPU task separation problem.

+2
source

Source: https://habr.com/ru/post/1203904/


All Articles