OpenCL Global Sampling

I'm thinking of redesigning the OpenCL GPU core to speed things up. The problem is that there is a large global memory that does not merge, and the samples really reduce performance. Therefore, I plan to copy most of the global memory to local, but I have to choose what to copy.

Now I have a question: how many downloads of small pieces of memory harm more than less samples of larger pieces?

+4
source share
3 answers

You can use clGetDeviceInfo to determine the cache size for the device. ( clGetDeviceInfo , CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE) On many devices today, this value is usually 16 bytes.

Small reads can be unpleasant, but if you are reading from the same line, you should be fine. The short answer is: you need to keep your “little chunks” in mind in order to maintain them quickly.

I have two functions below to demonstrate two ways to access memory - vectorAddFoo and vectorAddBar. The third copySomeMemory (...) function is specific to your question. Both vector functions have their own work elements, adding a part of added vectors, but use different memory access patterns. vectorAddFoo receives each work item for processing a block of vector elements, starting from its calculated position in arrays and moving forward along its workload. vectorAddBar has work items starting with their gid and passing gSize items (= global size) before retrieving and adding the following items.

vectorAddBar will run faster because reads and writes fall into the same cache in memory. Every 4 floating point reads will fall on the same cache line and perform only one action from the memory controller to execute. After reading [] and b [] in this question, all four work items will be able to add them and put the queue for writing to c [].

vectorAddFoo ensures that reads and writes are not in the same pattern (except for very short vectors ~ totalElements <5). Each reading from a work item will require action from the memory controller. If gpu does not cache the next 3 floats in each case, this will result in 4x memory access.

__kernel void vectorAddFoo(__global const float * a, __global const float * b, __global float * c, __global const totalElements) { int gid = get_global_id(0); int elementsPerWorkItem = totalElements/get_global_size(0); int start = elementsPerWorkItem * gid; for(int i=0;i<elementsPerWorkItem;i++){ c[start+i] = a[start+i] + b[start+i]; } } __kernel void vectorAddBar(__global const float * a, __global const float * b, __global float * c, __global const totalElements) { int gid = get_global_id(0); int gSize = get_global_size(0); for(int i=gid;i<totalElements;i+=gSize){ c[i] = a[i] + b[i]; } } __kernel void copySomeMemory(__global const int * src, __global const count, __global const position) { //copy 16kb of integers to local memory, starting at 'position' int start = position + get_local_id(0); int lSize = get_local_size(0); __local dst[4096]; for(int i=0;i<4096;i+=lSize ){ dst[start+i] = src[start+i]; } barrier(CLK_GLOBAL_MEM_FENCE); //use dst here... } 
+5
source

In general, fewer larger fects will be more effective. I cannot give you specific advice without seeing your code, but be sure to access sequential fragments from work items to enable streaming. Perform any conversion or random access to memory after transferring data to local memory.

+1
source

I cannot correctly understand your question, but if you have a lot of global access and if they are reused than use, use local memory.

Note: the small size of local work is less than data, therefore it is not used, the large size of local work is less than parallel threads. Therefore, you need to choose the best.

0
source

Source: https://habr.com/ru/post/1403960/


All Articles