The prototype __ballot is the following
unsigned int __ballot(int predicate);
If the predicate nonzero, __ballot returns the value using a set of N th bits, where N is the stream index.
In combination with atomicOr and __popc it can be used to accumulate the number of threads in each warp that has a true predicate.
Indeed, the prototype of atomicOr is
int atomicOr(int* address, int val);
and atomicOr reads the value pointed to by address , performs the bitwise OR operation with val and writes the value back to address and returns its old value as the return parameter.
On the other hand, __popc returns the number of bits specified with the 32 -bit parameter.
Accordingly, the instructions
volatile __shared__ u32 warp_shared_ballot[MAX_WARPS_PER_BLOCK]; const u32 warp_sum = threadIdx.x >> 5; atomicOr(&warp_shared_ballot[warp_num],__ballot(data[tid]>threshold)); atomicAdd(&block_shared_accumulate,__popc(warp_shared_ballot[warp_num]));
can be used to count the number of threads for which the predicate is true.
See Shane Cook, CUDA Programming, Morgan Kaufmann for details.
source share