OpenCL Image Histogram

I am trying to write a histogram core in OpenCL to compute 256 R, G, and B bins of histograms of an RGBA32F input image. My core looks like this:

const sampler_t mSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP| CLK_FILTER_NEAREST; __kernel void computeHistogram(read_only image2d_t input, __global int* rOutput, __global int* gOutput, __global int* bOutput) { int2 coords = {get_global_id(0), get_global_id(1)}; float4 sample = read_imagef(input, mSampler, coords); uchar rbin = floor(sample.x * 255.0f); uchar gbin = floor(sample.y * 255.0f); uchar bbin = floor(sample.z * 255.0f); rOutput[rbin]++; gOutput[gbin]++; bOutput[bbin]++; } 

When I run it on a 2100 x 894 (1,877,400 pixel) image, I tend to see only about 1,870,000 total values ​​when I summarize the histogram values ​​for each channel. It is also a different amount each time. I expected this, as from time to time, two cores probably get the same value from the output array and increment it, effectively canceling one increment operation (I suppose?).

The output is 1,870,000 for the workgroup size {1,1} (which, apparently, is set by default, unless I specify otherwise). If I force a larger workgroup size, for example {10.6}, I get a significantly worse amount on my histogram (proportional to the change in the size of the workgroup). It seemed strange to me, but I assume that it happens that all work items in the group increase the value of the output array at the same time, and therefore it is simply considered one increment?

In any case, I read in the specification that OpenCL does not have global synchronization synchronization, but only syncronization in local workgroups using their __local memory. An example of an nVidia histogram breaks up the histogram workload into a bunch of subtasks of a certain size, calculates their partial histograms, and then combines the results into one histogram after. It does not seem that everything will be fine for images of arbitrary size. I suppose I could put image data using dummy values ​​...

Being new to OpenCL, I think I'm wondering if there is an easier way to do this (since it seems to be a relatively simple GPGPU problem).

Thanks!

+6
source share
4 answers

As stated earlier, you write unsynchronized and non-atomic to shared memory. This leads to errors. If the image is big enough, I have a suggestion:

Divide the workgroup into one-dimensional for columns or rows. Use each core to sum the histogram for col or string, and then sum it globally with the atom_inc atom. This brings the most amounts in private memory, which is much faster and reduces atomic operations.

If you work in two dimensions, you can do this on parts of the image.

[EDIT:]

I think I have a better answer :;)

Have a look at: http://developer.download.nvidia.com/compute/opencl/sdk/website/samples.html#oclHistogram

They have an interesting implementation there ...

+5
source

Yes, you write to the shared memory from many work items at the same time, so you lose items if you don’t do the updates in a safe way (or worse? Just don’t do this). An increase in group size actually increases the use of your computing device, which in turn increases the likelihood of conflicts. Thus, you lose more updates.

However, you seem to be confusing synchronization (ordering the execution order of threads) and shared memory updates (which usually require either atomic operations or code synchronization and memory barriers) to make sure that memory updates are visible to other synchronized threads).

synchronization + barrier is not particularly useful for your case (and, as you noticed, is not available for global synchronization at any time). Of course, 2 thread groups can never be started at the same time, so trying to synchronize them is pointless). It is usually used when all threads start working on generating a common data set, and then everyone starts consuming this data set with a different access pattern.

In your case, you can use atomic operations (for example atom_inc, see http://www.cmsoft.com.br/index.php?option=com_content&view=category&layout=blog&id=113&Itemid=168 ). However, note that updating a highly secure memory address (say, because you have thousands of threads trying to write a total of 256 integers) is likely to result in poor performance. All codes of typical hoop histograms go through to reduce the difference in histogram data.

+5
source

you can check

+2
source

Source: https://habr.com/ru/post/887157/


All Articles