The problem is that you are using vectorize for a function that takes non-scalar arguments. The idea with NumbaPro vectorize is that it takes a scalar function as input and generates a function that applies the scalar operation in parallel to all elements of the vector. See Documentation NumbaPro Documentation .
Your function takes a matrix and a vector that are definitely not scalar. [Change] You can do what you want on the GPU, either using the NumbaPro shell for cuBLAS, or by writing your own simple kernel function. Here is an example that demonstrates both. A note will need NumbaPro 0.12.2 or later (just released from this edit).
from numbapro import jit, cuda from numba import float32 import numbapro.cudalib.cublas as cublas import numpy as np from timeit import default_timer as timer def generate_input(n): A = np.array(np.random.sample((n,n)), dtype=np.float32) B = np.array(np.random.sample(n), dtype=A.dtype) return A, B @cuda.jit(argtypes=[float32[:,:], float32[:,:], float32[:]]) def diagproduct(c, a, b): startX, startY = cuda.grid(2) gridX = cuda.gridDim.x * cuda.blockDim.x; gridY = cuda.gridDim.y * cuda.blockDim.y; height, width = c.shape for y in range(startY, height, gridY): for x in range(startX, width, gridX): c[y, x] = a[y, x] * b[x] def main(): N = 1000 A, B = generate_input(N) D = np.empty(A.shape, dtype=A.dtype) E = np.zeros(A.shape, dtype=A.dtype) F = np.empty(A.shape, dtype=A.dtype) start = timer() E = np.dot(A, np.diag(B)) numpy_time = timer() - start blas = cublas.api.Blas() start = timer() blas.gemm('N', 'N', N, N, N, 1.0, np.diag(B), A, 0.0, D) cublas_time = timer() - start diff = np.abs(DE) print("Maximum CUBLAS error %f" % np.max(diff)) blockdim = (32, 8) griddim = (16, 16) start = timer() dA = cuda.to_device(A) dB = cuda.to_device(B) dF = cuda.to_device(F, copy=False) diagproduct[griddim, blockdim](dF, dA, dB) dF.to_host() cuda_time = timer() - start diff = np.abs(FE) print("Maximum CUDA error %f" % np.max(diff)) print("Numpy took %f seconds" % numpy_time) print("CUBLAS took %f seconds, %0.2fx speedup" % (cublas_time, numpy_time / cublas_time)) print("CUDA JIT took %f seconds, %0.2fx speedup" % (cuda_time, numpy_time / cuda_time)) if __name__ == '__main__': main()
The kernel is much faster because SGEMM performs the full matrix matrix (O (n ^ 3)) and expands the diagonal to the full matrix. The diagproduct function diagproduct smarter. It simply does one multiplication for each matrix element and never expands the diagonal to a full matrix. Here are the results on my NVIDIA Tesla K20c GPU for N = 1000:
Maximum CUBLAS error 0.000000 Maximum CUDA error 0.000000 Numpy took 0.024535 seconds CUBLAS took 0.010345 seconds, 2.37x speedup CUDA JIT took 0.004857 seconds, 5.05x speedup
Timing includes all copies on and from the GPU, which is a significant bottleneck for small matrices. If we set N to 10000 and run it again, we get much more acceleration:
Maximum CUBLAS error 0.000000 Maximum CUDA error 0.000000 Numpy took 7.245677 seconds CUBLAS took 1.371524 seconds, 5.28x speedup CUDA JIT took 0.264598 seconds, 27.38x speedup
However, for very small matrices, CUBLAS SGEMM has an optimized path, so it is closer to CUDA performance. Here N = 100
Maximum CUBLAS error 0.000000 Maximum CUDA error 0.000000 Numpy took 0.006876 seconds CUBLAS took 0.001425 seconds, 4.83x speedup CUDA JIT took 0.001313 seconds, 5.24x speedup