Finding the minimum and maximum values ​​in a metal texture

I have an MTLTexture containing 16 bit unsigned integers ( MTLPixelFormatR16Uint ). Values ​​range from approximately 7000 to 20,000, with 0 being used as the "nodata" value, so it is skipped in the code below. I would like to find the minimum and maximum values ​​so that I can scale these values ​​between 0-255. Ultimately, I will look for the basics of the minimum and maximum values ​​in the data histogram (it has some outliers), but for now, I'm stuck in a simple min / max extraction.

I can read data from the GPU to the CPU and pull the min / max values, but I prefer to perform this task on the GPU.

First try

A command encoder is sent with 16x16 streams for each stream group, the number of stream groups is based on the size of the texture (for example, width = textureWidth / 16, height = textureHeight / 16).

 typedef struct { atomic_uint min; atomic_uint max; } BandMinMax; kernel void minMax(texture2d<ushort, access::read> band1 [[texture(0)]], device BandMinMax &out [[buffer(0)]], uint2 gid [[thread_position_in_grid]]) { ushort value = band1.read(gid).r; if (value != 0) { uint currentMin = atomic_load_explicit(&out.min, memory_order_relaxed); uint currentMax = atomic_load_explicit(&out.max, memory_order_relaxed); if (value > currentMax) { atomic_store_explicit(&out.max, value, memory_order_relaxed); } if (value < currentMin) { atomic_store_explicit(&out.min, value, memory_order_relaxed); } } } 

From this I get the minimum and maximum value, but for the same dataset, min and max often return different values. Exactly enough, this is min and max from one thread when multiple threads are running.

Second attempt

Based on the previous attempt, this time I save the individual min / max values ​​from each stream, all 256 (16x16).

 kernel void minMax(texture2d<ushort, access::read> band1 [[texture(0)]], device BandMinMax *out [[buffer(0)]], uint2 gid [[thread_position_in_grid]], uint tid [[ thread_index_in_threadgroup ]]) { ushort value = band1.read(gid).r; if (value != 0) { uint currentMin = atomic_load_explicit(&out[tid].min, memory_order_relaxed); uint currentMax = atomic_load_explicit(&out[tid].max, memory_order_relaxed); if (value > currentMax) { atomic_store_explicit(&out[tid].max, value, memory_order_relaxed); } if (value < currentMin) { atomic_store_explicit(&out[tid].min, value, memory_order_relaxed); } } } 

Returns an array containing 256 sets of min / max values. Of these, I think I could find the smallest of the minimum values, but this seems like a bad approach. Thank the pointer in the right direction, thanks!

+5
source share
1 answer

The metal shading language has atomic comparison and replacement functions that can be used to compare the existing value in the memory cell with the value and replace the value in this place if they are not compared with equal ones. With their help, you can create a set of atomic compare-and-replace-if- [more | less] -than operations:

 static void atomic_uint_exchange_if_less_than(volatile device atomic_uint *current, uint candidate) { uint val; do { val = *((device uint *)current); } while ((candidate < val || val == 0) && !atomic_compare_exchange_weak_explicit(current, &val, candidate, memory_order_relaxed, memory_order_relaxed)); } static void atomic_uint_exchange_if_greater_than(volatile device atomic_uint *current, uint candidate) { uint val; do { val = *((device uint *)current); } while (candidate > val && !atomic_compare_exchange_weak_explicit(current, &val, candidate, memory_order_relaxed, memory_order_relaxed)); } 

To apply them, you can create a buffer containing one pair with an interval of min, max for each group of threads. Then in the kernel function read the texture and conditionally write the minimum and maximum values:

 kernel void min_max_per_threadgroup(texture2d<ushort, access::read> texture [[texture(0)]], device uint *mapBuffer [[buffer(0)]], uint2 tpig [[thread_position_in_grid]], uint2 tgpig [[threadgroup_position_in_grid]], uint2 tgpg [[threadgroups_per_grid]]) { ushort val = texture.read(tpig).r; device atomic_uint *atomicBuffer = (device atomic_uint *)mapBuffer; atomic_uint_exchange_if_less_than(atomicBuffer + ((tgpig[1] * tgpg[0] + tgpig[0]) * 2), val); atomic_uint_exchange_if_greater_than(atomicBuffer + ((tgpig[1] * tgpg[0] + tgpig[0]) * 2) + 1, val); } 

Finally, run a separate kernel to reduce this buffer and collect the final min, max values ​​throughout the texture:

 kernel void min_max_reduce(constant uint *mapBuffer [[buffer(0)]], device uint *reduceBuffer [[buffer(1)]], uint2 tpig [[thread_position_in_grid]]) { uint minv = mapBuffer[tpig[0] * 2]; uint maxv = mapBuffer[tpig[0] * 2 + 1]; device atomic_uint *atomicBuffer = (device atomic_uint *)reduceBuffer; atomic_uint_exchange_if_less_than(atomicBuffer, minv); atomic_uint_exchange_if_greater_than(atomicBuffer + 1, maxv); } 

Of course, you can only reduce the total allowed execution width of the device stream (~ 256), so you may need to reduce the number of passes, and each of them will reduce the size of the data that will be used with a coefficient of the maximum width of the stream execution.

Disclaimer: this may not be the best method, but it seems correct in my limited testing of the OS X implementation. It was a little faster than a naive CPU implementation with 256x256 texture on Intel Iris Pro, but significantly slower on the Nvidia GT 750M (due to for unforeseen shipping costs).

+3
source

Source: https://habr.com/ru/post/1247260/


All Articles