You can use clGetDeviceInfo to determine the cache size for the device. ( clGetDeviceInfo , CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE) On many devices today, this value is usually 16 bytes.
Small reads can be unpleasant, but if you are reading from the same line, you should be fine. The short answer is: you need to keep your “little chunks” in mind in order to maintain them quickly.
I have two functions below to demonstrate two ways to access memory - vectorAddFoo and vectorAddBar. The third copySomeMemory (...) function is specific to your question. Both vector functions have their own work elements, adding a part of added vectors, but use different memory access patterns. vectorAddFoo receives each work item for processing a block of vector elements, starting from its calculated position in arrays and moving forward along its workload. vectorAddBar has work items starting with their gid and passing gSize items (= global size) before retrieving and adding the following items.
vectorAddBar will run faster because reads and writes fall into the same cache in memory. Every 4 floating point reads will fall on the same cache line and perform only one action from the memory controller to execute. After reading [] and b [] in this question, all four work items will be able to add them and put the queue for writing to c [].
vectorAddFoo ensures that reads and writes are not in the same pattern (except for very short vectors ~ totalElements <5). Each reading from a work item will require action from the memory controller. If gpu does not cache the next 3 floats in each case, this will result in 4x memory access.
__kernel void vectorAddFoo(__global const float * a, __global const float * b, __global float * c, __global const totalElements) { int gid = get_global_id(0); int elementsPerWorkItem = totalElements/get_global_size(0); int start = elementsPerWorkItem * gid; for(int i=0;i<elementsPerWorkItem;i++){ c[start+i] = a[start+i] + b[start+i]; } } __kernel void vectorAddBar(__global const float * a, __global const float * b, __global float * c, __global const totalElements) { int gid = get_global_id(0); int gSize = get_global_size(0); for(int i=gid;i<totalElements;i+=gSize){ c[i] = a[i] + b[i]; } } __kernel void copySomeMemory(__global const int * src, __global const count, __global const position) {
source share