I am trying to understand the architecture of OpenCL devices, such as GPUs, and I do not understand why there is an explicit binding of the number of work items in the local workgroup, that is, the constant CL_DEVICE_MAX_WORK_GROUP_SIZE.
It seems to me that this should be taken into account by the compiler, i.e. if the core is one-dimensional for simplicity, with a workgroup size of 500, and its physical maximum is 100, and the core looks, for example, like this:
__kernel void test(float* input) { i = get_global_id(0); someCode(i); barrier(); moreCode(i); barrier(); finalCode(i); }
then it can be automatically converted to execution with a workgroup size of 100 on this core:
__kernel void test(float* input) { i = get_global_id(0); someCode(5*i); someCode(5*i+1); someCode(5*i+2); someCode(5*i+3); someCode(5*i+4); barrier(); moreCode(5*i); moreCode(5*i+1); moreCode(5*i+2); moreCode(5*i+3); moreCode(5*i+4); barrier(); finalCode(5*i); finalCode(5*i+1); finalCode(5*i+2); finalCode(5*i+3); finalCode(5*i+4); }
However, this does not seem to be done by default. Why not? Is there a way to make this process automated (apart from writing a preliminary compiler for it)? Or is there an internal problem that could lead to the failure of my method on some examples (and you can give me one)?
source share