How to use Thrust to sort matrix rows?

Question

How to use Thrust to sort matrix rows?

I have a 5000x500 matrix and I want to sort each row separately with cuda. I can use arrayfire, but this is just a for for thrust :: sort loop that should not be efficient.

https://github.com/arrayfire/arrayfire/blob/devel/src/backend/cuda/kernel/sort.hpp

for(dim_type w = 0; w < val.dims[3]; w++) { dim_type valW = w * val.strides[3]; for(dim_type z = 0; z < val.dims[2]; z++) { dim_type valWZ = valW + z * val.strides[2]; for(dim_type y = 0; y < val.dims[1]; y++) { dim_type valOffset = valWZ + y * val.strides[1]; if(isAscending) { thrust::sort(val_ptr + valOffset, val_ptr + valOffset + val.dims[0]); } else { thrust::sort(val_ptr + valOffset, val_ptr + valOffset + val.dims[0], thrust::greater<T>()); } } } }

Is there a way to fuse operations at point blank so that sorts are performed in parallel? Indeed, what I'm looking for is a general merge method for loop iterations.

+3

sorting cuda thrust

amir Jan 26 '15 at 12:25

source share

2 answers

Thanks again to Robert Rovelle for this wonderful example and answer. I tried this and got similar results ... although on my machine the nested performance was almost the same as the vectorized result. In any case, I wanted to know whether it is possible to combine a nested method with a custom memory allocator. Presumably, the temporary memory needed for sorting (or the sort_by_key I need) will be regenerated for every thrust :: for_each call. Is it possible to use an execution policy such as 'thrust :: cuda :: par (alloc)' to combine user distribution with dynamic concurrency?

0

Scott sides Jun 11 '19 at 23:50

source share

Robert Crovella · Accepted Answer · 2015-01-31T18:00:30+0000

I can think of two possibilities, one of which is already proposed by @JaredHoberock. I do not know a general methodology to fuse for-loop iterations into traction, but the second method is a more general approach. I assume that the first method will be faster of the two approaches, in this case.

Use vector view. If the regions to be sorted by your nested loops do not overlap, you can do a vectorized sort using the 2 stable sort operations, as discussed.
Thrust v1.8 (available from CUDA 7 RC or via direct download from the github repository you need to do it, you can run these views with a single call to the traction algorithm by including the thrust::sort operation in the functor that you pass to thrust::for_each .

Here, the comparison of three methods is fully completed:

original sort-in-a-loop method
vector / batch view
nested sort

In each case, we sort the same 16,000 sets of 1,000 integers.

 $ cat t617.cu #include <thrust/device_vector.h> #include <thrust/device_ptr.h> #include <thrust/host_vector.h> #include <thrust/sort.h> #include <thrust/execution_policy.h> #include <thrust/generate.h> #include <thrust/equal.h> #include <thrust/sequence.h> #include <thrust/for_each.h> #include <iostream> #include <stdlib.h> #define NSORTS 16000 #define DSIZE 1000 int my_mod_start = 0; int my_mod(){ return (my_mod_start++)/DSIZE; } bool validate(thrust::device_vector<int> &d1, thrust::device_vector<int> &d2){ return thrust::equal(d1.begin(), d1.end(), d2.begin()); } struct sort_functor { thrust::device_ptr<int> data; int dsize; __host__ __device__ void operator()(int start_idx) { thrust::sort(thrust::device, data+(dsize*start_idx), data+(dsize*(start_idx+1))); } }; #include <time.h> #include <sys/time.h> #define USECPSEC 1000000ULL unsigned long long dtime_usec(unsigned long long start){ timeval tv; gettimeofday(&tv, 0); return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start; } int main(){ cudaDeviceSetLimit(cudaLimitMallocHeapSize, (16*DSIZE*NSORTS)); thrust::host_vector<int> h_data(DSIZE*NSORTS); thrust::generate(h_data.begin(), h_data.end(), rand); thrust::device_vector<int> d_data = h_data; // first time a loop thrust::device_vector<int> d_result1 = d_data; thrust::device_ptr<int> r1ptr = thrust::device_pointer_cast<int>(d_result1.data()); unsigned long long mytime = dtime_usec(0); for (int i = 0; i < NSORTS; i++) thrust::sort(r1ptr+(i*DSIZE), r1ptr+((i+1)*DSIZE)); cudaDeviceSynchronize(); mytime = dtime_usec(mytime); std::cout << "loop time: " << mytime/(float)USECPSEC << "s" << std::endl; //vectorized sort thrust::device_vector<int> d_result2 = d_data; thrust::host_vector<int> h_segments(DSIZE*NSORTS); thrust::generate(h_segments.begin(), h_segments.end(), my_mod); thrust::device_vector<int> d_segments = h_segments; mytime = dtime_usec(0); thrust::stable_sort_by_key(d_result2.begin(), d_result2.end(), d_segments.begin()); thrust::stable_sort_by_key(d_segments.begin(), d_segments.end(), d_result2.begin()); cudaDeviceSynchronize(); mytime = dtime_usec(mytime); std::cout << "vectorized time: " << mytime/(float)USECPSEC << "s" << std::endl; if (!validate(d_result1, d_result2)) std::cout << "mismatch 1!" << std::endl; //nested sort thrust::device_vector<int> d_result3 = d_data; sort_functor f = {d_result3.data(), DSIZE}; thrust::device_vector<int> idxs(NSORTS); thrust::sequence(idxs.begin(), idxs.end()); mytime = dtime_usec(0); thrust::for_each(idxs.begin(), idxs.end(), f); cudaDeviceSynchronize(); mytime = dtime_usec(mytime); std::cout << "nested time: " << mytime/(float)USECPSEC << "s" << std::endl; if (!validate(d_result1, d_result3)) std::cout << "mismatch 2!" << std::endl; return 0; } $ nvcc -arch=sm_20 -std=c++11 -o t617 t617.cu $ ./t617 loop time: 8.51577s vectorized time: 0.068802s nested time: 0.567959s $

Notes:

These results will vary significantly from GPU to GPU.
The "nested" time / method can vary significantly on the GPU, which can support dynamic parallelism, as this will affect how the thrust triggers the nested sort functions. For testing with dynamic parallelism, change the compilation options from -arch=sm_20 to -arch=sm_35 -rdc=true -lcudadevrt
This code requires CUDA 7 RC. I used Fedora 20.
The nested sorting method will also be allocated on the device side, so we must significantly increase the device allocation heap using cudaDeviceSetLimit .
If you use dynamic parallelism, and depending on the type of GPU you are using, the amount of memory reserved with cudaDeviceSetLimit may need to be increased by adding an additional factor of 8.

How to use Thrust to sort matrix rows?

More articles: