CUB (CUDA UnBound) traction equivalent :: collect

Question

CUB (CUDA UnBound) traction equivalent :: collect

Due to some performance issues with Thrust libraries (see this page for more information), I plan on re-factoring the CUDA application to use CUB instead of Persistent. In particular, to replace the calls to thrust :: sort_by_key and thrust :: inclusive_scan). In particular, in my application, I need to sort 3 arrays by key. Here's how I did it with traction:

thrust::sort_by_key(key_iter, key_iter + numKeys, indices); thrust::gather_wrapper(indices, indices + numKeys, thrust::make_zip_iterator(thrust::make_tuple(values1Ptr, values2Ptr, values3Ptr)), thrust::make_zip_iterator(thrust::make_tuple(valuesOut1Ptr, valuesOut2Ptr, valuesOut3Ptr)) );

Where

key iter is thrust :: device_ptr, which points to the keys I want to sort,
indices indicate the sequence (from 0 to numKeys-1) in the device memory
values{1,2,3}Ptr are device_ptrs for the values I want to sort
values{1,2,3}OutPtr are device_ptrs for sorted values

Using the CUB SortPairs function, I can sort the buffer with one value, but not all 3 in one shot. The problem is that I don’t see any CUB gather-like utilities. Suggestions?

EDIT:

I suppose I can implement my own native collection core, but is there a better way to do this otherwise than:

 template <typename Index, typename Value> __global__ void gather_kernel(const unsigned int N, const Index * map, const Value * src, Value * dst) { unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) { dst[i] = src[map[i]]; } }

Not coalesced loads and storages make me a succession, but this is probably inevitable without a known map structure.

+4

c ++ parallel-processing gpu cuda thrust

rmccabe3701 Oct 6 '13 at 15:43

source share

1 answer

kangshiyin · Accepted Answer · 2013-10-06T16:08:28+0000

It seems that you want to achieve depends on thrust::zip_iterator . You can either

replace thrust::sort_by_key with cub::DeviceRadixSort::SortPairs and save thrust::gather , or
zip values{1,2,3} to an array of structures before using cub::DeviceRadixSort::SortPairs

Update

After reading the implementation of thrust::gather ,

 $CUDA_HOME/include/thrust/system/detail/generic/gather.inl

you can see that this is only a naive core like

 __global__ gather(int* index, float* in, float* out, int len) { int i=...; if (i<len) { out[i] = in[index[i]]; } }

Then I think that your code above can be replaced with one core without much effort.

In this kernel, you can first use the CUB block block primitive cub::BlockRadixSort<...>::SortBlockedToStriped to get the sorted indexes stored in the registers, and then execute a naive copy of the reorder as thrust::gather to populate the values{1,2,3}Out .

Using SortBlockedToStriped rather than Sort can do a joint letter (not read) when copying values .

CUB (CUDA UnBound) traction equivalent :: collect

Update

More articles: