I am trying to write a histogram core in OpenCL to compute 256 R, G, and B bins of histograms of an RGBA32F input image. My core looks like this:
const sampler_t mSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP| CLK_FILTER_NEAREST; __kernel void computeHistogram(read_only image2d_t input, __global int* rOutput, __global int* gOutput, __global int* bOutput) { int2 coords = {get_global_id(0), get_global_id(1)}; float4 sample = read_imagef(input, mSampler, coords); uchar rbin = floor(sample.x * 255.0f); uchar gbin = floor(sample.y * 255.0f); uchar bbin = floor(sample.z * 255.0f); rOutput[rbin]++; gOutput[gbin]++; bOutput[bbin]++; }
When I run it on a 2100 x 894 (1,877,400 pixel) image, I tend to see only about 1,870,000 total values ββwhen I summarize the histogram values ββfor each channel. It is also a different amount each time. I expected this, as from time to time, two cores probably get the same value from the output array and increment it, effectively canceling one increment operation (I suppose?).
The output is 1,870,000 for the workgroup size {1,1} (which, apparently, is set by default, unless I specify otherwise). If I force a larger workgroup size, for example {10.6}, I get a significantly worse amount on my histogram (proportional to the change in the size of the workgroup). It seemed strange to me, but I assume that it happens that all work items in the group increase the value of the output array at the same time, and therefore it is simply considered one increment?
In any case, I read in the specification that OpenCL does not have global synchronization synchronization, but only syncronization in local workgroups using their __local memory. An example of an nVidia histogram breaks up the histogram workload into a bunch of subtasks of a certain size, calculates their partial histograms, and then combines the results into one histogram after. It does not seem that everything will be fine for images of arbitrary size. I suppose I could put image data using dummy values ββ...
Being new to OpenCL, I think I'm wondering if there is an easier way to do this (since it seems to be a relatively simple GPGPU problem).
Thanks!