I can think of two possibilities, one of which is already proposed by @JaredHoberock. I do not know a general methodology to fuse for-loop iterations into traction, but the second method is a more general approach. I assume that the first method will be faster of the two approaches, in this case.
Use vector view. If the regions to be sorted by your nested loops do not overlap, you can do a vectorized sort using the 2 stable sort operations, as discussed.
Thrust v1.8 (available from CUDA 7 RC or via direct download from the github repository you need to do it, you can run these views with a single call to the traction algorithm by including the thrust::sort operation in the functor that you pass to thrust::for_each .
Here, the comparison of three methods is fully completed:
- original sort-in-a-loop method
- vector / batch view
- nested sort
In each case, we sort the same 16,000 sets of 1,000 integers.
$ cat t617.cu #include <thrust/device_vector.h> #include <thrust/device_ptr.h> #include <thrust/host_vector.h> #include <thrust/sort.h> #include <thrust/execution_policy.h> #include <thrust/generate.h> #include <thrust/equal.h> #include <thrust/sequence.h> #include <thrust/for_each.h> #include <iostream> #include <stdlib.h> #define NSORTS 16000 #define DSIZE 1000 int my_mod_start = 0; int my_mod(){ return (my_mod_start++)/DSIZE; } bool validate(thrust::device_vector<int> &d1, thrust::device_vector<int> &d2){ return thrust::equal(d1.begin(), d1.end(), d2.begin()); } struct sort_functor { thrust::device_ptr<int> data; int dsize; __host__ __device__ void operator()(int start_idx) { thrust::sort(thrust::device, data+(dsize*start_idx), data+(dsize*(start_idx+1))); } }; #include <time.h> #include <sys/time.h> #define USECPSEC 1000000ULL unsigned long long dtime_usec(unsigned long long start){ timeval tv; gettimeofday(&tv, 0); return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start; } int main(){ cudaDeviceSetLimit(cudaLimitMallocHeapSize, (16*DSIZE*NSORTS)); thrust::host_vector<int> h_data(DSIZE*NSORTS); thrust::generate(h_data.begin(), h_data.end(), rand); thrust::device_vector<int> d_data = h_data; // first time a loop thrust::device_vector<int> d_result1 = d_data; thrust::device_ptr<int> r1ptr = thrust::device_pointer_cast<int>(d_result1.data()); unsigned long long mytime = dtime_usec(0); for (int i = 0; i < NSORTS; i++) thrust::sort(r1ptr+(i*DSIZE), r1ptr+((i+1)*DSIZE)); cudaDeviceSynchronize(); mytime = dtime_usec(mytime); std::cout << "loop time: " << mytime/(float)USECPSEC << "s" << std::endl; //vectorized sort thrust::device_vector<int> d_result2 = d_data; thrust::host_vector<int> h_segments(DSIZE*NSORTS); thrust::generate(h_segments.begin(), h_segments.end(), my_mod); thrust::device_vector<int> d_segments = h_segments; mytime = dtime_usec(0); thrust::stable_sort_by_key(d_result2.begin(), d_result2.end(), d_segments.begin()); thrust::stable_sort_by_key(d_segments.begin(), d_segments.end(), d_result2.begin()); cudaDeviceSynchronize(); mytime = dtime_usec(mytime); std::cout << "vectorized time: " << mytime/(float)USECPSEC << "s" << std::endl; if (!validate(d_result1, d_result2)) std::cout << "mismatch 1!" << std::endl; //nested sort thrust::device_vector<int> d_result3 = d_data; sort_functor f = {d_result3.data(), DSIZE}; thrust::device_vector<int> idxs(NSORTS); thrust::sequence(idxs.begin(), idxs.end()); mytime = dtime_usec(0); thrust::for_each(idxs.begin(), idxs.end(), f); cudaDeviceSynchronize(); mytime = dtime_usec(mytime); std::cout << "nested time: " << mytime/(float)USECPSEC << "s" << std::endl; if (!validate(d_result1, d_result3)) std::cout << "mismatch 2!" << std::endl; return 0; } $ nvcc -arch=sm_20 -std=c++11 -o t617 t617.cu $ ./t617 loop time: 8.51577s vectorized time: 0.068802s nested time: 0.567959s $
Notes:
- These results will vary significantly from GPU to GPU.
- The "nested" time / method can vary significantly on the GPU, which can support dynamic parallelism, as this will affect how the thrust triggers the nested sort functions. For testing with dynamic parallelism, change the compilation options from
-arch=sm_20 to -arch=sm_35 -rdc=true -lcudadevrt - This code requires CUDA 7 RC. I used Fedora 20.
- The nested sorting method will also be allocated on the device side, so we must significantly increase the device allocation heap using
cudaDeviceSetLimit . - If you use dynamic parallelism, and depending on the type of GPU you are using, the amount of memory reserved with
cudaDeviceSetLimit may need to be increased by adding an additional factor of 8.
source share