CUDA: How does Thrust manage memory when using a comparator in a sort function?

I have a 10 character char array that I would like to pass as an argument to the comparator that the Thrust sort function will use.

To allocate memory for this array, I use cudaMalloc . However, cudaMalloc allocates memory in global memory, so whenever a stream wants to read data from this array, it must access global memory.

but this array is small, and I believe that it would be more efficient if it were stored in some shared memory, or even in the registers of each thread. However, is it possible to achieve this with Thrust, and if so, how?

Here is the comparator:

 struct comp{ int *data_to_sort; char *helpingArray; comp(int *data_ptr) this->data_to_sort = data_ptr; __host__ __device__ bool operator()(const int&a, const int&b){ //use helpingArray to do some comparisons and // return true/false accordingly } } 

then I allocate memory for helpingArray in global memory and pass it as an argument with Comparator struct to the sort function.

Please note that the data_to_sort array is stored in global memory, because it contains data that needs to be sorted, we cannot avoid this.

This works fine, and the sorting method is faster than the cpu sorting method, however, I believe that if I avoid storing helpingArray in global memory, the sorting method will become much faster.

+4
source share
1 answer

I would agree that moving helpingArray to global memory makes little sense and reduces performance, at least to some extent. The reference end that starts the kernels is โ€œclosedโ€ and does not provide kernel-level functions such as shared memory or registers, so they cannot be used directly.

Having said that, there are perhaps two things you can do to improve this. The first would be to declare your functor as follows:

 struct comp{ char helpingArray[10]; __host__ __device__ bool operator()(const int&a, const int&b){ ... } } 

You can fill in the helpingArray in the main code before passing the algorithm used to the algorithm (note that the functor is passed by value, so this is completely legal). In this case, helpingArray probably terminates in the local network of the thread. This may or may not be an increase in productivity. Of course, this greatly simplifies the host code needed for support.

Another alternative is to declare a helpingArray in helpingArray memory and just call it inside the functor. If the access pattern for each thread is consistent, then this may be due to improved performance due to persistent cache.

+1
source

Source: https://habr.com/ru/post/1487706/


All Articles