Memory corruption with clEnqueueWriteBuffer - OpenCL

I work in some code that sends large amounts of data from the host to the device and behaves erratically.

In the following code fragment, I try to send an array from a node to a device. The size of the array increases at each iteration, gradually increasing the amount of memory sent to the device. The first element of the array is filled with a nonzero value, and it is read from the inside of the kernel and printed to the console. The value should be the same when reading from the host and device, but in some iterations this is not so.

Here is the code:

int SizeArray = 0; for(int j=1; j<100 ;j++){ //Array memory allocation, starting with 4MB in first iteration to 400MB in last one SizeArray = j * 1000000 * sizeof(float); Array = (float*)malloc(SizeArray); memset(Array, 0, SizeArray); //Give the array first element some nonzero value //This is the value that is expected to be printed by the kernel execution Array[0] = j; memArray = clCreateBuffer(context, CL_MEM_READ_WRITE, SizeArray, NULL, &ret); //Write the array contents into the buffer inside the device ret = clEnqueueWriteBuffer(command_queue, memArray, CL_TRUE, 0, SizeArray, Array, 0, NULL, NULL); ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memArray); getchar(); //Execute the kernel where the content of the first element of the array will be printed ret = clEnqueueNDRangeKernel(command_queue, kernel, 3, NULL, mGlobalWorkSizePtr, mLocalWorkSizePtr, 0, NULL,NULL); ret = clFinish(command_queue); /****** FAIL! Kernel prints correct value of Array first element ONLY IN SOME ITERATIONS (when it fails zero values are printed)! Depending on SizeArray :?? ******/ free(Array); ret = clReleaseMemObject(memArray); } 

The device in which this code was tested has the following functions:

  • - Name: Intel (R) HD Graphics 4000 - DeviceVersion: OpenCL 1.1 - DriverVersion: 8.15.10.2696 - MaxMemoryAllocationSize: 425721856 - GlobalMemoryCacheSize: 2097152 - GlobalMemorySize: 1702887424 - MaxConstantBufferSize: 65536 - LocalMemorySize: 655

The kernel prints incorrect values ​​or not, depending on the size of the buffer sent to the device.

Here's the conclusion:

 Array GPU: 1.000000 Array GPU: 2.000000 Array GPU: 3.000000 Array GPU: 4.000000 Array GPU: 5.000000 Array GPU: 6.000000 Array GPU: 7.000000 Array GPU: 8.000000 Array GPU: 9.000000 Array GPU: 10.000000 Array GPU: 11.000000 Array GPU: 12.000000 Array GPU: 13.000000 Array GPU: 14.000000 Array GPU: 15.000000 Array GPU: 16.000000 Array GPU: 17.000000 Array GPU: 18.000000 Array GPU: 19.000000 Array GPU: 20.000000 Array GPU: 21.000000 Array GPU: 22.000000 Array GPU: 23.000000 Array GPU: 24.000000 Array GPU: 25.000000 Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 34.000000 Array GPU: 35.000000 Array GPU: 36.000000 Array GPU: 37.000000 Array GPU: 38.000000 Array GPU: 39.000000 Array GPU: 40.000000 Array GPU: 41.000000 Array GPU: 42.000000 Array GPU: 43.000000 Array GPU: 44.000000 Array GPU: 45.000000 Array GPU: 46.000000 Array GPU: 47.000000 Array GPU: 48.000000 Array GPU: 49.000000 Array GPU: 50.000000 Array GPU: 51.000000 Array GPU: 52.000000 Array GPU: 53.000000 Array GPU: 54.000000 Array GPU: 55.000000 Array GPU: 56.000000 Array GPU: 57.000000 Array GPU: 58.000000 Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 0.000000 &lt-------- INCORRECT VALUE, kernel is receiving corrupted memory Array GPU: 68.000000 Array GPU: 69.000000 ... 

As you can see, invalid values ​​are accepted by the device without a visible pattern, and the clEnqueueWriteBuffer function does not return an error code.

To summarize: a memory block is sent to the kernel, but the kernel receives zero memory, depending on the total sent block size.

The same code tested on different computers behaves differently (incorrect values ​​in different iterations).

How can memory corruption be avoided? Did I miss something?

Thanks in advance.


Here's the full working code:


Edit: After some tests, you need to clarify that the problem is not printf. It seems that the problem is transferring data to the device prior to the kernel execution.

Here is the code without executing the kernel. The results are still not consistent.

+5
source share
1 answer

You tried

  CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR 

so how does your gpu use the same processor memory?

The device is also located at the same host location for your iGPU.

Create several buffers, do a stress test on them if they all get invalid values, and then install a different version of the driver, possibly a newer one, if that doesn't solve, RMA is your card.

If only one buffer is wrong, then this is a simple vram error, a tag that the buffer as unsuitable, and create new buffers as necessary and avoid this buffer, but Im not sure if the driver is changing buffers in the background. If each individual core works, the cores may be damaged.

0
source

Source: https://habr.com/ru/post/1202783/


All Articles