How to efficiently collect data from streams in CUDA?

I have an application that solves the equation system in CUDA, I know for sure that each thread can find up to 4 solutions, but how can I copy back to the host?

I am passing a huge array with enough space for all threads storing 4 solutions (4 two-local for each solution) and the other with the number of solutions per stream, however this is a naive solution and my kernel is the current bottleneck.

I really like to optimize this. The main problem is the concatenation of a variable number of solutions per stream in a single array.

+6
source share
1 answer

The functionality you are looking for is called stream compression.

You probably need to provide an array in which there is room for 4 decisions per stream, because an attempt to directly store the results in a compact form is likely to create so many dependencies between the streams that the performance obtained when copying less data back to the host is lost more long kernel runtimes. The exception is that almost all threads will not find solutions. In this case, you can use the atomic operation to maintain the index in the array. So, for each solution you find, you save it in an array with an index, and then use the atomic operation to increase the index. I think it would be safe to use atomicAdd () for this. Before storing the result, the thread will use atomicAdd () to increase the index by one. atomicAdd () returns the old value, and the thread can save the result using the old value as an index.

However, given the more common situation when there are many results, the best solution would be to perform the compaction operation as a separate step. One way to do this is thrust::copy_if . See this question for more details.

+5
source

Source: https://habr.com/ru/post/918712/


All Articles