Why is cuSparse so much slower than cuBlas for sparse matrix multiplication

Recently, when I used cuSparse and cuBLAS in CUDA TOOLKIT 6.5 to do sparse matrix multiplication, I find that cuSPARSE is much slower than cuBLAS in all cases!

In all my experiments, I used cusparseScsrmm in cuSparse and cublasSgemm in cuBLAS. In a sparse matrix, half of all elements are zero. The graphics processor I used is NVIDIA Titan Black. In addition, all the time spent getting it is achieved using the nvvp tool provided by NVIDIA. Below are some results:

Experiment A:

  • sparse matrix size: 192x2400
  • Dense matrix size: 2400x256
  • cusparse time: 1.4 ms
  • cublas time: 0.21ms

Experiment B:

  • sparse matrix size: 192x75
  • Dense Matrix Size: 75x1024
  • cusparse time: 0.27ms
  • cublas time: 0.04ms

So, it is very strange to see the results listed above. Since cuSPARSE is specifically designed to handle sparse matrix manipulations, how can it be even slower than cuBLAS !? If so, then there is no need to use cuSPARSE. Could you give me any explanation for the results? Also, could you suggest any other ways to speed up the reproduction of matrix matrices?

+6
source share
1 answer

I don’t think you can classify a half-zero matrix as “sparse”: the time you found is reasonable (in fact, the sparse algorithm behaves pretty well!).

Sparse algorithms are effective only when considering matrices, where most of the elements are zeros (for example, matrices emerging from finite element problems).

This is true for processors, but not only for GPUs: there is important overhead when considering a matrix as sparse, and it becomes more convenient to use sparse algorithms only when ... most of the elements are zeros (typical: ten or less non-zeros per row, the matrix of rank of thousands - hundreds of thousands - (millions?)).

There are other matrix forms that have effective solution algorithms that you can try if they are applicable to your problem, for example .. I don’t know if they were ported to cuBlas.

About service data

Dense linear algebra algorithms can work optimally because processors were designed to solve such systems most efficiently. Consider the DGEMM operation (matrix matrix multiplication): this is an operation that allows the use of processors at> 95% of its theoretical peak floating-point performance for large matrices (i.e., matrices that do not correspond to the system cache). How?

  • prefetch
  • optimal cache utilization
  • vectorization (SSE, AVX)
  • conveyor belt

In the sparse LA algorithm, only nonzero elements and their corresponding indices are stored in memory: memory accesses are actually indirect. Thus, a sparse algorithm cannot use hardware at the same optimization level: I do not know about certain indicators in this context, but from 10 to 20% will not be strange.

Obviously, the gain is that operations with zeros (on unsaved elements) are simply not performed, which leads to the fact that the values ​​are smaller than the operations and much less necessary storage.

There is additional overhead in whole logics, conventions, but modern processors are pretty good in overlapping whole and FP-operations and with "speculative executions". Unfortunately, they can also prevent vectorization, as well as additional overhead in relation to a tight case.

What about GPUs?

The dense LA algorithm is optimal for GPUs as well as for processors: in this case you have optimal use:

  • merging
  • Common memory
  • memory access patterns

Again, indirect access to matrix elements in a sparse LA algorithm prevents the use of the same optimization level.

References

I can’t remember which one I used when rare problems occurred ... I think this is PSBLAS: http://people.uniroma2.it/salvatore.filippone/psblas/

But here you will be stunned by them: http://www.netlib.org/utk/people/JackDongarra/la-sw.html

+12
source

Source: https://habr.com/ru/post/986802/


All Articles