The functionality you are looking for is called stream compression.
You probably need to provide an array in which there is room for 4 decisions per stream, because an attempt to directly store the results in a compact form is likely to create so many dependencies between the streams that the performance obtained when copying less data back to the host is lost more long kernel runtimes. The exception is that almost all threads will not find solutions. In this case, you can use the atomic operation to maintain the index in the array. So, for each solution you find, you save it in an array with an index, and then use the atomic operation to increase the index. I think it would be safe to use atomicAdd () for this. Before storing the result, the thread will use atomicAdd () to increase the index by one. atomicAdd () returns the old value, and the thread can save the result using the old value as an index.
However, given the more common situation when there are many results, the best solution would be to perform the compaction operation as a separate step. One way to do this is thrust::copy_if
. See this question for more details.
source share