What is the best way to implement a small lookup table in the OpenCL core

In my kernel, it is necessary to make a large number of random calls to a small search table (a total of 8 32-bit integers). Each core has a unique lookup table. The following is a simplified version of the kernel to illustrate how the lookup table is used.

__kernel void some_kernel( __global uint* global_table, __global uint* X, __global uint* Y) { size_t gsi = get_global_size(0); size_t gid = get_global_id(0); __private uint LUT[8]; // 8 words of of global_table is copied to LUT // Y is assigned a value from the lookup table based on the current value of X for (size_t i = 0; i < n; i++) { Y[i*gsi+gid] = LUT[X[i*gsi+gid]]; } } 

Due to its small size, I get better performance by storing the table in __private memory. However, due to the random nature handled by the lookup table, great success is still being observed. When deleting the search table code (for example, replaced by a simple arithmetic operation), although the kernel will provide an incorrect answer, the performance will improve by more than 3 times.

Is there a better way? Have I missed some OpenCL function that provides efficient random access for very small pieces of memory? Could there be an effective solution using vector types?

[edit] Note that the maximum value of X is 7, but the maximum value of Y is 2 ^ 32-1. In other words, all bits of the lookup table are used, so they cannot be packaged into a smaller representation.

+6
source share
2 answers

The quickest solution I can think of is to not use arrays in the first place: instead, use separate variables and use some access function to access them, as if they were an array. IIRC (at least for the AMD compiler, but I'm sure this is true for NVidia as well): usually arrays are always stored in memory, and scalars can be stored in registers. (But my mind is a little vague in the matter. Maybe I'm wrong!)

Even if you need a giant switch instruction:

 uint4 arr0123, arr4567; uint getLUT(int x) { switch (x) { case 0: return arr0123.r0; case 1: return arr0123.r1; case 2: return arr0123.r2; case 3: return arr0123.r3; case 4: return arr4567.r0; case 5: return arr4567.r1; case 6: return arr4567.r2; case 7: default: return arr4567.r3; } } 

... you can still go ahead in performance compared to the __private array, since assuming the arr variables all fit into the registers are purely ALU-bound. (Assuming you have enough spare registers for arr variables, of course.)

Please note: some OpenCL objects do not even have private memory, and everything you declare just goes into __global. Using register storage is a big gain.

Of course, this LUT approach is likely to be slower to initialize, because you will need at least two separate reads in memory to copy LUT data from global memory.

+4
source

As rtollert said, before implementation it is necessary to decide whether LUT [] is placed in registers or in global memory. Usually arrays in the kernel are no-no, but since it is small, it is difficult to say where it will be located. Assuming that LUT [] is placed in registers, I would say that the reason it takes a lot of time compared to a simple arithmetic operation is not that it was obtained randomly, but because each work item does an extra 8 (Edit: seem more more more) global reads X to calculate the LUT index. Depending on what you missed, you could do something like Y [i * gsi + gid] = global_table [someIndex + X [i * gsi + gid]]];?

+1
source

Source: https://habr.com/ru/post/899726/


All Articles