The quickest solution I can think of is to not use arrays in the first place: instead, use separate variables and use some access function to access them, as if they were an array. IIRC (at least for the AMD compiler, but I'm sure this is true for NVidia as well): usually arrays are always stored in memory, and scalars can be stored in registers. (But my mind is a little vague in the matter. Maybe I'm wrong!)
Even if you need a giant switch instruction:
uint4 arr0123, arr4567; uint getLUT(int x) { switch (x) { case 0: return arr0123.r0; case 1: return arr0123.r1; case 2: return arr0123.r2; case 3: return arr0123.r3; case 4: return arr4567.r0; case 5: return arr4567.r1; case 6: return arr4567.r2; case 7: default: return arr4567.r3; } }
... you can still go ahead in performance compared to the __private array, since assuming the arr variables all fit into the registers are purely ALU-bound. (Assuming you have enough spare registers for arr variables, of course.)
Please note: some OpenCL objects do not even have private memory, and everything you declare just goes into __global. Using register storage is a big gain.
Of course, this LUT approach is likely to be slower to initialize, because you will need at least two separate reads in memory to copy LUT data from global memory.
source share