If it is not too large, try defining your filter globally inside the .cl file.
There you can try to select it either in __ constant or __ local space and compare which one is faster. But not all SDKs support global variables in the __local address space (I'm looking at you ATI).
If you still want to pass the filter as a kernel argument, consider calling SetKernelArg (0, ...) only once . You also do not need to call SetKernelArg () 1000 times if the value or index of the kernel argument does not change. Although this may not have a measurable effect on performance, it is still cleaner.
source share