About the warp voting feature

In the CUDA programming guide, the concept of the voice voting function, "_all", "_any" and "__ballot" was introduced.

My question is: which applications will use these 3 functions?

+6
source share
3 answers

__ballot used in the CUDA-histogram and in the CUDA NPP library to quickly generate bitmaxes and combine it with __popc internal to make a very efficient implementation of Boolean reductions.

__all and __any used in abbreviations before the introduction of __ballot , although I can’t think of other uses for them.

+4
source

The prototype __ballot is the following

 unsigned int __ballot(int predicate); 

If the predicate nonzero, __ballot returns the value using a set of N th bits, where N is the stream index.

In combination with atomicOr and __popc it can be used to accumulate the number of threads in each warp that has a true predicate.

Indeed, the prototype of atomicOr is

 int atomicOr(int* address, int val); 

and atomicOr reads the value pointed to by address , performs the bitwise OR operation with val and writes the value back to address and returns its old value as the return parameter.

On the other hand, __popc returns the number of bits specified with the 32 -bit parameter.

Accordingly, the instructions

 volatile __shared__ u32 warp_shared_ballot[MAX_WARPS_PER_BLOCK]; const u32 warp_sum = threadIdx.x >> 5; atomicOr(&warp_shared_ballot[warp_num],__ballot(data[tid]>threshold)); atomicAdd(&block_shared_accumulate,__popc(warp_shared_ballot[warp_num])); 

can be used to count the number of threads for which the predicate is true.

See Shane Cook, CUDA Programming, Morgan Kaufmann for details.

+1
source

As an example of an algorithm using the __ballot API, I would mention In-Kernel DM stream compaction Hughes et al. It is used in the prefix total portion of the stream compaction to count (for warp) the number of elements that pass the predicate.

Here is an article. In-k Stream Compaction

0
source

Source: https://habr.com/ru/post/915528/


All Articles