Thrust :: device_vector in read-only memory

I have a floating point array that needs to be referenced many times on the device, so I believe that the best place to store it is in __ permanent __ memory (using this link ). An array (or vector) will need to be written once at runtime during initialization, but repeatedly read several different functions, so constantly copying each function call to the kernel looks like a "Bad idea".

const int n = 32; __constant__ float dev_x[n]; //the array in question struct struct_max : public thrust::unary_function<float,float> { float C; struct_max(float _C) : C(_C) {} __host__ __device__ float operator()(const float& x) const { return fmax(x,C);} }; void foo(const thrust::host_vector<float> &, const float &); int main() { thrust::host_vector<float> x(n); //magic happens populate x cudaMemcpyToSymbol(dev_x,x.data(),n*sizeof(float)); foo(x,0.0); return(0); } void foo(const thrust::host_vector<float> &input_host_x, const float &x0) { thrust::device_vector<float> dev_sol(n); thrust::host_vector<float> host_sol(n); //this method works fine, but the memory transfer is unacceptable thrust::device_vector<float> input_dev_vec(n); input_dev_vec = input_host_x; //I want to avoid this thrust::transform(input_dev_vec.begin(),input_dev_vec.end(),dev_sol.begin(),struct_max(x0)); host_sol = dev_sol; //this memory transfer for debugging //this method compiles fine, but crashes at runtime thrust::device_ptr<float> dev_ptr = thrust::device_pointer_cast(dev_x); thrust::transform(dev_ptr,dev_ptr+n,dev_sol.begin(),struct_max(x0)); host_sol = dev_sol; //this line crashes } 

I tried adding global traction :: device_vector dev_x (n), but also crashed at runtime and was in __ global __ memory, not __ constant__ memory

This can be done to work if I just drop the traction library, but is there a way to use the traction library with global and persistent device memory?

+6
source share
1 answer

Good question! You cannot use the __constant__ array as if it were a regular device pointer.

I will answer your question (after the line below), but first: this is a bad use of __constant__ , and this is not quite what you want. CUDA's persistent cache is optimized for uniform access to threads at the core. This means that all threads in the core have the same location at the same time. If each warp thread accesses a different permanent memory location, then the calls become serialized. Thus, your access pattern, where sequential threads will access sequential memory cells, will be 32 times slower than regular access. You should just use the device memory. If you need to write data once, but read it many times, then simply use device_vector: initialize it once, and then read it many times.


To do what you requested, you can use thrust::counting_iterator as an input to thrust::transform to create a range of indices in your __constant__ array. Then your operator() functor accepts the operand of the int index, not the operand of the float value, and searches in read-only memory.

(Note that this means that your functor is now only in __device__ code. You can easily overload the statement to accept a float and call it differently if you need portability.)

I modified your example to initialize the data and print the result to make sure it is correct.

 #include <stdio.h> #include <stdlib.h> #include <thrust/device_vector.h> #include <thrust/host_vector.h> #include <thrust/iterator/counting_iterator.h> const int n = 32; __constant__ float dev_x[n]; //the array in question struct struct_max : public thrust::unary_function<float,float> { float C; struct_max(float _C) : C(_C) {} // only works as a device function __device__ float operator()(const int& i) const { // use index into constant array return fmax(dev_x[i],C); } }; void foo(const thrust::host_vector<float> &input_host_x, const float &x0) { thrust::device_vector<float> dev_sol(n); thrust::host_vector<float> host_sol(n); thrust::device_ptr<float> dev_ptr = thrust::device_pointer_cast(dev_x); thrust::transform(thrust::make_counting_iterator(0), thrust::make_counting_iterator(n), dev_sol.begin(), struct_max(x0)); host_sol = dev_sol; //this line crashes for (int i = 0; i < n; i++) printf("%f\n", host_sol[i]); } int main() { thrust::host_vector<float> x(n); //magic happens populate x for (int i = 0; i < n; i++) x[i] = rand() / (float)RAND_MAX; cudaMemcpyToSymbol(dev_x,x.data(),n*sizeof(float)); foo(x, 0.5); return(0); } 
+6
source

Source: https://habr.com/ru/post/947103/


All Articles