Does STL push multiple vector transforms?

I was wondering if there was a more efficient way to write a = a + b + c?

thrust::transform(b.begin(), b.end(), c.begin(), b.begin(), thrust::plus<int>()); thrust::transform(a.begin(), a.end(), b.begin(), a.begin(), thrust::plus<int>()); 

This works, but is there a way to get the same effect using only one line of code? I looked at the saxpy implementation in the examples, however this uses 2 vectors and a constant value;


Is it more efficient?

 struct arbitrary_functor { template <typename Tuple> __host__ __device__ void operator()(Tuple t) { // D[i] = A[i] + B[i] + C[i]; thrust::get<3>(t) = thrust::get<0>(t) + thrust::get<1>(t) + thrust::get<2>(t); } }; int main(){ // allocate storage thrust::host_vector<int> A; thrust::host_vector<int> B; thrust::host_vector<int> C; // initialize input vectors A.push_back(10); B.push_back(10); C.push_back(10); // apply the transformation thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(A.begin(), B.begin(), C.begin(), A.begin())), thrust::make_zip_iterator(thrust::make_tuple(A.end(), B.end(), C.end(), A.end())), arbitrary_functor()); // print the output std::cout << A[0] << std::endl; return 0; } 
+6
source share
1 answer

a = a + b + c has a low arithmetic intensity (only two arithmetic operations for every 4 memory operations), so the calculation will be related to the memory bandwidth. To compare the effectiveness of the solutions you offer, we need to measure their bandwidth requirements.

Each transform call in the first solution requires two loads and one storage for each plus call. Thus, we can model the cost of each transform call as 3N , where N is the size of the vectors a , b and c . Since there are two transform calls, the cost of this solution is 6N .

We can model the cost of the second solution in the same way. Each call to arbitrary_functor requires three loads and one storage. Thus, the cost model for this solution will be 4N , which implies that the for_each solution should be more efficient than calling transform twice. When N large, the second solution should execute 6N/4N = 1.5x faster than the first.

Of course, you can always combine zip_iterator with transform similar way to avoid two separate transform calls.

+7
source

Source: https://habr.com/ru/post/897813/


All Articles