Performing parallel tasks on multiple devices requires dynamic planning for good efficiency, because you never know the exact performance of any device - it depends on the current load (not only your program, but also all others), current hours (it can change significantly for most processors and GPUs depending on the current energy-saving profile or load). In addition, actual performance may depend on your input.
Of course, you can write all the necessary code yourself, like all other answers, but, in my opinion, this is a waste of time, and it is much better to use the existing solution. I recommend using StarPU. I used StarPU in my OpenCL project and it worked very well. StarPU comes with examples of how to write code that can efficiently use multiple GPUs and processors.
StarPU :
Traditional processors have reached the architectural limits that they intend to address heterogeneous multi-core designs and hardware specialization (for example, coprocessors, accelerators ...). However, the use of such machines presents many complex problems at all levels, from programming models and compilers to developing scalable hardware solutions. Designing efficient runtime systems for these architectures is an important issue. StarPU typically makes it much easier to use high-performance libraries or compiler environments to use heterogeneous multi-core machines, possibly equipped with GPGPUs or Cell processors: instead of handling low-level problems, programmers can focus on algorithmic problems.
There is another SkePU project, but I have not tried it myself:
SkePU :
SkePU is such a skeleton programming framework for multi-core processors and multiprocessor systems. This is a C ++ template library with six parallel data and one parallel task skeleton, two types of containers and support for execution on systems with multiple GPUs, both with CUDA and OpenCL. SkePU recently developed support for hybrid execution, dynamic scheduling for performance and load balancing by implementing a backend for the StarPU runtime system.
If you use Google for "dynamic planning of gpu cpu opencl", you can find even more useful free or commercial projects and documentation.
source share