To give a “quantitative” answer to Robert’s answer, let's look at the Marx Harris pruning approach using the CUDA shuffling operations detailed in Kepler’s faster concurrent pruning .
In this approach, base reduction is done with __shfl_down . An alternative approach to reducing deformation uses __shfl_xor according to Lecture 4: warp shuf fl es and reduce / scan operations . Below I report the complete code that implements both approaches. If they are tested on a Kepler K20c, both take 0.044ms to reduce the array of N=200000 float elements. Accordingly, both approaches outperform Thrust reduce by two orders of magnitude, since the execution time for the Thrust case is 1.06ms for the same test.
Here is the complete code:
#include <thrust\device_vector.h>
source share