The prototype __ballot
is the following
unsigned int __ballot(int predicate);
If the predicate
nonzero, __ballot
returns the value using a set of N
th bits, where N
is the stream index.
In combination with atomicOr
and __popc
it can be used to accumulate the number of threads in each warp that has a true predicate.
Indeed, the prototype of atomicOr
is
int atomicOr(int* address, int val);
and atomicOr
reads the value pointed to by address
, performs the bitwise OR
operation with val
and writes the value back to address
and returns its old value as the return parameter.
On the other hand, __popc
returns the number of bits specified with the 32
-bit parameter.
Accordingly, the instructions
volatile __shared__ u32 warp_shared_ballot[MAX_WARPS_PER_BLOCK]; const u32 warp_sum = threadIdx.x >> 5; atomicOr(&warp_shared_ballot[warp_num],__ballot(data[tid]>threshold)); atomicAdd(&block_shared_accumulate,__popc(warp_shared_ballot[warp_num]));
can be used to count the number of threads for which the predicate is true.
See Shane Cook, CUDA Programming, Morgan Kaufmann for details.
source share