Is Boost.Compute slower than a regular processor?

Question

Is Boost.Compute slower than a regular processor?

I just started playing with Boost.Compute, to find out how much speed it can bring us, I wrote a simple program:

#include <iostream> #include <vector> #include <algorithm> #include <boost/foreach.hpp> #include <boost/compute/core.hpp> #include <boost/compute/platform.hpp> #include <boost/compute/algorithm.hpp> #include <boost/compute/container/vector.hpp> #include <boost/compute/functional/math.hpp> #include <boost/compute/types/builtin.hpp> #include <boost/compute/function.hpp> #include <boost/chrono/include.hpp> namespace compute = boost::compute; int main() { // generate random data on the host std::vector<float> host_vector(16000); std::generate(host_vector.begin(), host_vector.end(), rand); BOOST_FOREACH (auto const& platform, compute::system::platforms()) { std::cout << "====================" << platform.name() << "====================\n"; BOOST_FOREACH (auto const& device, platform.devices()) { std::cout << "device: " << device.name() << std::endl; compute::context context(device); compute::command_queue queue(context, device); compute::vector<float> device_vector(host_vector.size(), context); // copy data from the host to the device compute::copy( host_vector.begin(), host_vector.end(), device_vector.begin(), queue ); auto start = boost::chrono::high_resolution_clock::now(); compute::transform(device_vector.begin(), device_vector.end(), device_vector.begin(), compute::sqrt<float>(), queue); auto ans = compute::accumulate(device_vector.begin(), device_vector.end(), 0, queue); auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start); std::cout << "ans: " << ans << std::endl; std::cout << "time: " << duration.count() << " ms" << std::endl; std::cout << "-------------------\n"; } } std::cout << "====================plain====================\n"; auto start = boost::chrono::high_resolution_clock::now(); std::transform(host_vector.begin(), host_vector.end(), host_vector.begin(), [](float v){ return std::sqrt(v); }); auto ans = std::accumulate(host_vector.begin(), host_vector.end(), 0); auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start); std::cout << "ans: " << ans << std::endl; std::cout << "time: " << duration.count() << " ms" << std::endl; return 0; }

And here is a sample output on my computer (win7 64-bit):

 ====================Intel(R) OpenCL==================== device: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz ans: 1931421 time: 64 ms ------------------- device: Intel(R) HD Graphics 4600 ans: 1931421 time: 64 ms ------------------- ====================NVIDIA CUDA==================== device: Quadro K600 ans: 1931421 time: 4 ms ------------------- ====================plain==================== ans: 1931421 time: 0 ms

My question is: why is the simple (non-opencl) version faster?

+6

c ++ boost opencl boost-compute

Jamboree Jun 18 '14 at 8:17

source share

3 answers

Kyle lutz · Answer 1 · 2014-06-19T00:43:53+0000

As others said, in your kernel, most likely, there is not enough computation to make it expedient to run on the GPU for one data set (you are limited by the kernel compilation time and the transfer time to the GPU),

To get better performance, you have to run the algorithm several times (and most likely throw the first one, since it will be much more, since it includes time for compiling and storing the kernels).

In addition, instead of transform() and accumulate() as separate operations, you should use the transform_reduce() algorithm, which performs conversion and reduction using a single core. The code will look like this:

 float ans = 0; compute::transform_reduce( device_vector.begin(), device_vector.end(), &ans, compute::sqrt<float>(), compute::plus<float>(), queue ); std::cout << "ans: " << ans << std::endl;

You can also compile code using Boost.Compute with -DBOOST_COMPUTE_USE_OFFLINE_CACHE , which will include the autonomous kernel cache (this requires binding to boost_filesystem ). Then the kernels you use will be saved on your file system and will only be compiled the first time you launch your application (by default, this is done by NVIDIA on Linux).

Skizz · Answer 2 · 2014-06-18T08:47:51+0000

I see one possible reason for the big difference. Compare processor and GPU data stream: -

 CPU GPU copy data to GPU set up compute code calculate sqrt calculate sqrt sum sum copy data from GPU

Given this, it seems that the Intel chip is just a bit of rubbish in general computing, NVidia probably suffers from additional data copying and GPU settings for calculation.

You should try the same program, but with a much more complicated operation - sqrt and sum are too simple to overcome the extra overhead when using the GPU. You could try to compute Mandlbrot points, for example.

In your example, moving the lambda to the cluster will be faster (one pass through memory versus two passes)

Roman arzumanyan · Answer 3 · 2014-06-18T15:21:52+0000

You have poor results because you are not measuring time correctly.

An OpenCL device has its own time counters that are not associated with host counters. Each OpenCL task has 4 states whose timers can be requested: (from Khronos website)

CL_PROFILING_COMMAND_QUEUED when a command identified by an event is queued by the host
CL_PROFILING_COMMAND_SUBMIT when a command identified by an event that has been queued is sent by the host to the device associated with the command queue.
CL_PROFILING_COMMAND_START when the command identified by the event starts execution on the device.
CL_PROFILING_COMMAND_END when the team identified by the event completed execution on the device.

Note that the timers are on the device side . Thus, to measure the performance of the kernel and the command line, you can request these timers. In your case, the last 2 timers are needed.

In your code example, you measure host time, which includes data transfer time (like Skizz ) plus all the time wasted in servicing the command queue.

So, to find out the actual kernel performance, you need to either pass cl_event to your kernel (I don’t know how to do this in boost :: compute), and request this event for performance counters, or make your kernel really huge and complex to hide all overhead.

Is Boost.Compute slower than a regular processor?

More articles: