Why is there CL_DEVICE_MAX_WORK_GROUP_SIZE?

I am trying to understand the architecture of OpenCL devices, such as GPUs, and I do not understand why there is an explicit binding of the number of work items in the local workgroup, that is, the constant CL_DEVICE_MAX_WORK_GROUP_SIZE.

It seems to me that this should be taken into account by the compiler, i.e. if the core is one-dimensional for simplicity, with a workgroup size of 500, and its physical maximum is 100, and the core looks, for example, like this:

__kernel void test(float* input) { i = get_global_id(0); someCode(i); barrier(); moreCode(i); barrier(); finalCode(i); } 

then it can be automatically converted to execution with a workgroup size of 100 on this core:

 __kernel void test(float* input) { i = get_global_id(0); someCode(5*i); someCode(5*i+1); someCode(5*i+2); someCode(5*i+3); someCode(5*i+4); barrier(); moreCode(5*i); moreCode(5*i+1); moreCode(5*i+2); moreCode(5*i+3); moreCode(5*i+4); barrier(); finalCode(5*i); finalCode(5*i+1); finalCode(5*i+2); finalCode(5*i+3); finalCode(5*i+4); } 

However, this does not seem to be done by default. Why not? Is there a way to make this process automated (apart from writing a preliminary compiler for it)? Or is there an internal problem that could lead to the failure of my method on some examples (and you can give me one)?

+6
source share
2 answers

I think the beginning of CL_DEVICE_MAX_WORK_GROUP_SIZE lies in the underlying hardware implementation.

Several threads are simultaneously launched on computing devices, and each of them must save state (for calling, jmp, etc.). Most implementations use a stack for this, and if you look at the AMD Evergreen family, this is a hardware limitation on the number of available stack entries (each stack entry has sub-entries). This, in fact, limits the number of threads that each computing unit can process simultaneously.

As for the compiler, this can do it. It might work, but understand that it would mean recompiling the kernel again. This is not always possible. I can imagine situations when developers unload the compiled kernel for each platform in binary format and send it with their software only for "not so frank" reasons.

+4
source

These constants are requested from the device by the compiler to determine the appropriate size of the workgroup at compile time (where course compilation refers to kernel compilation). Perhaps you are mistaken, but it seems that you are thinking of setting these values โ€‹โ€‹yourself, which would not be.

The responsibility lies with your code to request the system capabilities that need to be prepared for any equipment on which it will operate.

0
source

Source: https://habr.com/ru/post/907165/


All Articles