I don’t think you can classify a half-zero matrix as “sparse”: the time you found is reasonable (in fact, the sparse algorithm behaves pretty well!).
Sparse algorithms are effective only when considering matrices, where most of the elements are zeros (for example, matrices emerging from finite element problems).
This is true for processors, but not only for GPUs: there is important overhead when considering a matrix as sparse, and it becomes more convenient to use sparse algorithms only when ... most of the elements are zeros (typical: ten or less non-zeros per row, the matrix of rank of thousands - hundreds of thousands - (millions?)).
There are other matrix forms that have effective solution algorithms that you can try if they are applicable to your problem, for example .. I don’t know if they were ported to cuBlas.
About service data
Dense linear algebra algorithms can work optimally because processors were designed to solve such systems most efficiently. Consider the DGEMM operation (matrix matrix multiplication): this is an operation that allows the use of processors at> 95% of its theoretical peak floating-point performance for large matrices (i.e., matrices that do not correspond to the system cache). How?
- prefetch
- optimal cache utilization
- vectorization (SSE, AVX)
- conveyor belt
In the sparse LA algorithm, only nonzero elements and their corresponding indices are stored in memory: memory accesses are actually indirect. Thus, a sparse algorithm cannot use hardware at the same optimization level: I do not know about certain indicators in this context, but from 10 to 20% will not be strange.
Obviously, the gain is that operations with zeros (on unsaved elements) are simply not performed, which leads to the fact that the values are smaller than the operations and much less necessary storage.
There is additional overhead in whole logics, conventions, but modern processors are pretty good in overlapping whole and FP-operations and with "speculative executions". Unfortunately, they can also prevent vectorization, as well as additional overhead in relation to a tight case.
What about GPUs?
The dense LA algorithm is optimal for GPUs as well as for processors: in this case you have optimal use:
- merging
- Common memory
- memory access patterns
Again, indirect access to matrix elements in a sparse LA algorithm prevents the use of the same optimization level.
References
I can’t remember which one I used when rare problems occurred ... I think this is PSBLAS: http://people.uniroma2.it/salvatore.filippone/psblas/
But here you will be stunned by them: http://www.netlib.org/utk/people/JackDongarra/la-sw.html