Is there an overhead for parallel_for (Inter TBB) similar to the overhead we see on std :: function?

Question

Is there an overhead for parallel_for (Inter TBB) similar to the overhead we see on std :: function?

There is a good discussion about std :: function overhead in this link std :: function vs template . Basically, to avoid the 10x overhead caused by allocating a heap of the functor that you pass to the std :: function constructor, you should use std :: ref or std :: cref.

An example taken from @CassioNeri's answer that shows how to pass lambdas to std :: function by reference.

float foo(std::function<float(float)> f) { return -1.0f * f(3.3f) + 666.0f; } foo(std::cref([a,b,c](float arg){ return arg * 0.5f; }));

The Intel Thread Building Block library now provides the ability to parallelly evaluate loops using lambda / functors, as shown in the example below.

Code example:

 #include "tbb/task_scheduler_init.h" #include "tbb/blocked_range.h" #include "tbb/parallel_for.h" #include "tbb/tbb_thread.h" #include <vector> int main() { tbb::task_scheduler_init init(tbb::tbb_thread::hardware_concurrency()); std::vector<double> a(1000); std::vector<double> c(1000); std::vector<double> b(1000); std::fill(b.begin(), b.end(), 1); std::fill(c.begin(), c.end(), 1); auto f = [&](const tbb::blocked_range<size_t>& r) { for(size_t j=r.begin(); j!=r.end(); ++j) a[j] = b[j] + c[j]; }; tbb::parallel_for(tbb::blocked_range<size_t>(0, 1000), f); return 0; }

So my question is: Does Intel TBB parallel_for have the same overhead (allocation of a bunch of functors) that we see in std :: function? Should I pass my / lambdas functors with a reference to parallel_for using std :: cref to speed up the code?

+4

c ++ c ++ 11 tbb

Vivian miranda Sep 01 '13 at 1:02

source share

2 answers

Should I pass my / lambdas functors with a reference to parallel_for using std :: cref to speed up the code?

I do not know the answer to your main question. But that doesn’t matter, because you should never do this with tbb::parallel_for .

As Cassio Neri pointed out in his answer:

Finally, note that the lambda lifetime encompasses the std :: function.

This is true for the circumstances of the question he asked. But this does not apply to tbb::parallel_for . The whole point of parallel_for is that it will call this function from other threads at an arbitrary time in the future.

If you give it some functor by reference, you must make sure that this lifespan of this functor continues until parallel_for completes. Otherwise, parallel_for may try to call a reference to the destroyed object.

This is bad.

Therefore, no matter what happens, you cannot cure it with links.

+1

Nicol bolas Sep 01 '13 at 1:13

source share

Arch D. robison · Accepted Answer · 2013-09-01T20:56:44+0000

Passing a functor using std :: cref is likely to be counterproductive, but I don't do promises. Only empirical testing in the exact context of interest can be final. In general, for tbb :: parallel_for, my recommendation is:

Pass lambda by value.
If there are semantic considerations that dictate the capture mode, use lambda objects by reference if they are not small objects that are cheap to copy. Remember that normally captured variables will be available much more often than a copy of lambda.

Does TBB pay the heap allocation cost for the functor? The answer is definitely not for the signature of the parallel_for (first, * last *, functor) form, because this form passes the functor by reference.

For the signature of the form parallel_for (range, * functor *), as in the question, the answer is “no extra cost”. This is not a bunch - the direct assignment of a functor. But each task created by TBB has a copy of the functor, and tasks are distributed in heaps (usually quickly through local free lists). Using std :: cref will not change the fact that tasks are distributed in heaps. Using std :: cref will simply add an extra level of indirection.

In fact, I was a little surprised that one of the forms tbb :: parallel_for passes the functor by reference, and the other by value. I forgot the reason, and I'm sure the TBB group must have discussed this. The choice may have been motivated by which tests and machines were available at the time they were introduced, or perhaps the PPL compatibility issue with the “first, last” form, which does not appear to require the functor to be available for copy. As previously outlined, the trade-off between transmission performance compared to transmission in size is not simple. A pass-through makes the transfer of the functor cheap, but adds the cost of indirection to each click (if the compiler cannot optimize it).

As for the lifetime of the functor argument, it just has to exist for the duration of the parallel_for call.

Is there an overhead for parallel_for (Inter TBB) similar to the overhead we see on std :: function?

More articles: