Scaling matrix rows using CUDA

In some calculations on the GPU, I need to scale the rows in the matrix so that all the elements in this row are summed with 1.

  |  a 1,1 a 1,2 ... a 1, N |  |  alpha 1 * a 1,1 alpha 1 * a 1,2 ... alpha 1 * a 1, N |
 |  a 2.1 a 2.2 ... a 2, N |  => |  alpha 2 * a 2,1 alpha 2 * a 2,2 ... alpha 2 * a 2, N |
 |  .  .  |  |  .  .  |
 |  a N, 1 a N, 2 ... a N, N |  |  alpha N * a N, 1 alpha N * a N, 2 ... alpha N * a N, N |

Where

  alpha i = 1.0 / (a i, 1 + a i, 2 + ... + a i, N )

I need the alpha vector, but a scaled matrix, and I would like to make it as few blas calls as possible. The code will run on the Nvidia CUDA hardware. Does anyone know of any smart way to do this?

+4
source share
3 answers

If you are using a BLAS gemv with a single vector, the result will be the reciprocal of the scaling factors (1 / alpha) you need. This is the easy part.

Applying the row wise arguments is a bit more complicated since the standard BLAS has nothing like the Hadamard product operator you could use. Also, since you mention BLAS, I assume that you are using the large column store for your matrices, which is not so easy for multi-row operations. A very slow way to do this would be a BLAS scal in each row in increments, but this will require one BLAS call per row, and access to distributed memory will result in performance loss due to the effect on the coalescence and coherence of the L1 cache.

My suggestion was to use your own kernel for the second operation. It doesn't have to be complicated, maybe just something like this:

 template<typename T> __global__ void rowscale(T * X, const int M, const int N, const int LDA, const T * ralpha) { for(int row=threadIdx.x; row<M; row+=gridDim.x) { const T rscale = 1./ralpha[row]; for(int col=blockIdx.x; col<N; col+=blockDim.x) X[row+col*LDA] *= rscale; } } 

It’s just a bunch of blocks that crossed rows in columns, scaling up as they move. Should work for any dimensional columnar basic ordered matrix. Memory access should be combined, but depending on how concerned you are about performance, there are a number of optimizations that you might try. This, at least, gives a general idea of ​​what to do.

+2
source

Cublas 5.0 introduces a blas-like subroutine cublas (Type) dgmm, which is a multiplication of a matrix by a diagonal matrix (represented by a vector).

There is a left option (which scales the rows) or a right option that will scale the column.

See the CUBLAS 5.0 documentation for more details.

So, in your problem you need to create a vector containing all the alpha on the GPU and use cublasdgmm with the left option.

+5
source

I want to update the answers above using an example using CUDA Thrust thrust::transform and cuBLAS cublas<t>dgmm . I skip the calculation of the scale coefficients alpha , as it was already discussed in

Reduce matrix rows with CUDA

and

Reduce matrix columns with CUDA

The following is a complete example:

 #include <thrust/device_vector.h> #include <thrust/reduce.h> #include <thrust/random.h> #include <thrust/sort.h> #include <thrust/unique.h> #include <thrust/equal.h> #include <cublas_v2.h> #include "Utilities.cuh" #include "TimingGPU.cuh" /**************************************************************/ /* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */ /**************************************************************/ template <typename T> struct linear_index_to_row_index : public thrust::unary_function<T,T> { T Ncols; // --- Number of columns __host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {} __host__ __device__ T operator()(T i) { return i / Ncols; } }; /***********************/ /* RECIPROCAL OPERATOR */ /***********************/ struct Inv: public thrust::unary_function<float, float> { __host__ __device__ float operator()(float x) { return 1.0f / x; } }; /********/ /* MAIN */ /********/ int main() { /**************************/ /* SETTING UP THE PROBLEM */ /**************************/ const int Nrows = 10; // --- Number of rows const int Ncols = 3; // --- Number of columns // --- Random uniform integer distribution between 0 and 100 thrust::default_random_engine rng; thrust::uniform_int_distribution<int> dist1(0, 100); // --- Random uniform integer distribution between 1 and 4 thrust::uniform_int_distribution<int> dist2(1, 4); // --- Matrix allocation and initialization thrust::device_vector<float> d_matrix(Nrows * Ncols); for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist1(rng); // --- Normalization vector allocation and initialization thrust::device_vector<float> d_normalization(Nrows); for (size_t i = 0; i < d_normalization.size(); i++) d_normalization[i] = (float)dist2(rng); printf("\n\nOriginal matrix\n"); for(int i = 0; i < Nrows; i++) { std::cout << "[ "; for(int j = 0; j < Ncols; j++) std::cout << d_matrix[i * Ncols + j] << " "; std::cout << "]\n"; } printf("\n\nNormlization vector\n"); for(int i = 0; i < Nrows; i++) std::cout << d_normalization[i] << "\n"; TimingGPU timerGPU; /*********************************/ /* ROW NORMALIZATION WITH THRUST */ /*********************************/ thrust::device_vector<float> d_matrix2(d_matrix); timerGPU.StartCounter(); thrust::transform(d_matrix2.begin(), d_matrix2.end(), thrust::make_permutation_iterator( d_normalization.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols))), d_matrix2.begin(), thrust::divides<float>()); std::cout << "Timing - Thrust = " << timerGPU.GetCounter() << "\n"; printf("\n\nNormalized matrix - Thrust case\n"); for(int i = 0; i < Nrows; i++) { std::cout << "[ "; for(int j = 0; j < Ncols; j++) std::cout << d_matrix2[i * Ncols + j] << " "; std::cout << "]\n"; } /*********************************/ /* ROW NORMALIZATION WITH CUBLAS */ /*********************************/ d_matrix2 = d_matrix; cublasHandle_t handle; cublasSafeCall(cublasCreate(&handle)); timerGPU.StartCounter(); thrust::transform(d_normalization.begin(), d_normalization.end(), d_normalization.begin(), Inv()); cublasSafeCall(cublasSdgmm(handle, CUBLAS_SIDE_RIGHT, Ncols, Nrows, thrust::raw_pointer_cast(&d_matrix2[0]), Ncols, thrust::raw_pointer_cast(&d_normalization[0]), 1, thrust::raw_pointer_cast(&d_matrix2[0]), Ncols)); std::cout << "Timing - cuBLAS = " << timerGPU.GetCounter() << "\n"; printf("\n\nNormalized matrix - cuBLAS case\n"); for(int i = 0; i < Nrows; i++) { std::cout << "[ "; for(int j = 0; j < Ncols; j++) std::cout << d_matrix2[i * Ncols + j] << " "; std::cout << "]\n"; } return 0; } 

The Utilities.cu and Utilities.cuh files are here and omitted here. TimingGPU.cu and TimingGPU.cuh supported here and are also omitted.

I tested the above code on Kepler K20c and this is the result:

  Thrust cuBLAS 2500 x 1250 0.20ms 0.25ms 5000 x 2500 0.77ms 0.83ms 

In the cuBLAS schedule cuBLAS I exclude cublasCreate time. Even so, the CUDA Thrust version seems more convenient.

+2
source

Source: https://habr.com/ru/post/1396588/


All Articles