How multi-threaded tail recursion using TBB

Question

How multi-threaded tail recursion using TBB

I am trying to use tbb for multithreading an existing recursive algorithm. The single-threaded version uses tail call recursion, structurally it looks something like this:

void my_func() { my_recusive_func (0); } bool doSomeWork (int i, int& a, int& b, int& c) { // do some work } void my_recusive_func (int i) { int a, b, c; bool notDone = doSomeWork (i, a, b, c); if (notDone) { my_recusive_func (a); my_recusive_func (b); my_recusive_func (c); } }

I am new to tbb, so my first attempt used the parallel_invoke function:

 void my_recusive_func (int i) { int a, b, c; bool notDone = doSomeWork (i, a, b, c); if (notDone) { tbb::parallel_invoke ( [a]{my_recusive_func (a);}, [b]{my_recusive_func (b);}, [c]{my_recusive_func (c);}); } }

This works, and it works faster than the single-threaded version, but it does not seem to scale with the number of cores. I am targeting 16 cores (32 hyperthreads), so scaling is very important for this project, but this version only speeds up work on this computer 8 times faster, and many cores seem to be inactive while the algorithm is running.

My theory is that tbb is waiting for child tasks to complete after parallel_invoke, so there can be many tasks that sit around waiting for a wait without extra wait? Will this explain idle cores? Is there a way to return the parental task without waiting for the children to appear? I thought maybe something like this, but I still don't know enough about the scheduler, but I know if everything is okay or not:

 void my_func() { tbb::task_group g; my_recusive_func (0, g); g.wait(); } void my_recusive_func (int i, tbb::task_group& g) { int a, b, c; bool notDone = doSomeWork (i, a, b, c); if (notDone) { g.run([a,&g]{my_recusive_func(a, g);}); g.run([b,&g]{my_recusive_func(b, g);}); my_recusive_func (c, g); } }

My first question is: tbb::task_group::run() thread safe? I could not figure it out from the documentation. Also, is there a better way to do this? Perhaps I should use low-level scheduler calls?

(I typed this code without compilation, so please forgive the typos.)

+6

c ++ multithreading c ++ 11 recursion tbb

atb May 24, '14 at 13:31

source share

3 answers

I am sure tbb::task_group::run() is thread safe. I can not find a mention in the documentation, which is pretty surprising.

but

This 2008 blog post contains a primitive implementation of task_group , whose run() method is clearly identified as thread safe. The current implementation is pretty similar.
The testing code for tbb::task_group (in src/test/test_task_group.cpp ) comes with a test designed to check the security of the task_group (it generates many threads, each of which calls run() thousand times or more on the same task_group object) .
The sudoku code example (in examples/task_group/sudoku/sudoku.cpp ) that comes with TBB also calls task_group::run from multiple threads in a recursive function, essentially the same as your suggested code does.
task_group is one of the functions shared between TBB and Microsoft PPL, whose task_group thread safe . Although the TBB documentation warns that the behavior may still be different between the TBB and PPL versions, it would be completely unexpected if there was something fundamental like thread safety (and therefore the need for external synchronization).
tbb::structured_task_group (described as "like a task_group , but has only a subset of functionality") has an explicit limitation that "The run , run_and_wait , cancel and wait methods should only be called by the thread that created structured_task_group ."

+2

TC May 24, '14 at 15:10

source share

You can also implement this as follows:

 constexpr int END = 10; constexpr int PARALLEL_LIMIT = END - 4; static void do_work(int i, int j) { printf("%d, %d\n", i, j); } static void serial_recusive_func(int i, int j) { // DO WORK HERE // ... do_work(i,j); if (i < END) { serial_recusive_func(i+1, 0); serial_recusive_func(i+1, 1); serial_recusive_func(i+1, 2); } } class RecursiveTask : public tbb::task { int i; int j; public: RecursiveTask(int i, int j) : tbb::task(), i(i), j(j) {} task *execute() override { //DO WORK HERE //... do_work(i,j); if (i >= END) return nullptr; if (i < PARALLEL_LIMIT) { auto &c = *new (allocate_continuation()) tbb::empty_task(); c.set_ref_count(3); spawn(*new(c.allocate_child()) RecursiveTask(i+1, 0)); spawn(*new(c.allocate_child()) RecursiveTask(i+1, 1)); recycle_as_child_of(c); i = i+1; j = 2; return this; } else { serial_recusive_func(i+1, 0); serial_recusive_func(i+1, 1); serial_recusive_func(i+1, 2); } return nullptr; } }; static void my_func() { tbb::task::spawn_root_and_wait( *new(tbb::task::allocate_root()) RecursiveTask(0, 0)); } int main() { my_func(); }

Your question did not have much information about the "do work here", so my implementation does not allow do_work to return a value or affect recursion. If you need this, you should edit your question to include a mention of what effect "work here" is expected to have for general calculation.

0

Mankarse May 24, '14 at 15:00

source share

Arch D. robison · Accepted Answer · 2014-05-27T15:20:33+0000

There are two questions here:

Is the TBB implementation a task_group :: run thread-safe task? Yes. (We must do this more clearly).
Are there many threads calling the run () method on the same scalable task_group? No. (I believe Microsoft's documentation mentions this somewhere.) The reason is that task_group is becoming a central point of contention. This is just a fetch and add to the implementation, but it is still ultimately impractical since the affected cache line should bounce.

As a rule, it is best to spawn a small number of tasks from task_group. If using recursive parallelism, give each level its own task_group. Although performance will most likely not be better than using parallel_invoke.

The best option is low-level tbb :: task interfaces. You can even encode tail recursion using a trick where tasK :: execute returns a pointer to the tail call task.

But I'm a little worried about idle threads. I am wondering if there is enough work to support the flow. Consider first an analysis of the workspace . If you use the Intel compiler (or gcc 4.9), you can try experimenting with the Cilk version first. If this does not accelerate, then even the low-level tbb :: task interface is unlikely to help, and higher-level problems (work and range) should be considered.

How multi-threaded tail recursion using TBB

More articles: