Anaconda NumbaPro CUDA Error

Question

Anaconda NumbaPro CUDA Error

I am trying to use the NumbaPro cuda extension to multiply arrays of large arrays. In the end, I want to multiply a matrix of size NxN by a diagonal matrix, which will be presented as a 1D matrix (thus, a.dot (numpy.diagflat (b)), which I found is synonymous with * b). However, I get an assertion error that does not provide any information.

I can avoid this assertion error if I multiply two matrices of a 1D array, but that’s not what I want to do.

from numbapro import vectorize, cuda from numba import f4,f8 import numpy as np def generate_input(n): import numpy as np A = np.array(np.random.sample((n,n))) B = np.array(np.random.sample(n) + 10) return A, B def product(a, b): return a * b def main(): cu_product = vectorize([f4(f4, f4), f8(f8, f8)], target='gpu')(product) N = 1000 A, B = generate_input(N) D = np.empty(A.shape) stream = cuda.stream() with stream.auto_synchronize(): dA = cuda.to_device(A, stream) dB = cuda.to_device(B, stream) dD = cuda.to_device(D, stream, copy=False) cu_product(dA, dB, out=dD, stream=stream) dD.to_host(stream) if __name__ == '__main__': main()

This is what my terminal spits out:

 Traceback (most recent call last): File "cuda_vectorize.py", line 32, in <module> main() File "cuda_vectorize.py", line 28, in main cu_product(dA, dB, out=dD, stream=stream) File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 109, in __call__ File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 191, in _arguments_requirement AssertionError

+4

python numpy anaconda numba numba-pro

brebs Jun 17 '13 at 18:09

source share

2 answers

Just to recover from all these considerations. I also wanted to implement some matrix calculations on CUDA, but then I heard about the numpy.einsum function. It turns out einsum is incredibly fast. In that case, here is the code for it. But it can be applied to many types of calculations.

 G = np.einsum('ij,j -> ij',A, B)

In terms of speed, here are the results for N = 10000

 Numpy took 8.387756 seconds CUDA JIT took 0.218394 seconds, 38.41x speedup EINSUM took 0.131751 seconds, 63.66x speedup

+1

Remi Oct 9 '14 at 18:34

source share

harrism · Accepted Answer · 2013-06-18T05:13:02+0000

The problem is that you are using vectorize for a function that takes non-scalar arguments. The idea with NumbaPro vectorize is that it takes a scalar function as input and generates a function that applies the scalar operation in parallel to all elements of the vector. See Documentation NumbaPro Documentation .

Your function takes a matrix and a vector that are definitely not scalar. [Change] You can do what you want on the GPU, either using the NumbaPro shell for cuBLAS, or by writing your own simple kernel function. Here is an example that demonstrates both. A note will need NumbaPro 0.12.2 or later (just released from this edit).

 from numbapro import jit, cuda from numba import float32 import numbapro.cudalib.cublas as cublas import numpy as np from timeit import default_timer as timer def generate_input(n): A = np.array(np.random.sample((n,n)), dtype=np.float32) B = np.array(np.random.sample(n), dtype=A.dtype) return A, B @cuda.jit(argtypes=[float32[:,:], float32[:,:], float32[:]]) def diagproduct(c, a, b): startX, startY = cuda.grid(2) gridX = cuda.gridDim.x * cuda.blockDim.x; gridY = cuda.gridDim.y * cuda.blockDim.y; height, width = c.shape for y in range(startY, height, gridY): for x in range(startX, width, gridX): c[y, x] = a[y, x] * b[x] def main(): N = 1000 A, B = generate_input(N) D = np.empty(A.shape, dtype=A.dtype) E = np.zeros(A.shape, dtype=A.dtype) F = np.empty(A.shape, dtype=A.dtype) start = timer() E = np.dot(A, np.diag(B)) numpy_time = timer() - start blas = cublas.api.Blas() start = timer() blas.gemm('N', 'N', N, N, N, 1.0, np.diag(B), A, 0.0, D) cublas_time = timer() - start diff = np.abs(DE) print("Maximum CUBLAS error %f" % np.max(diff)) blockdim = (32, 8) griddim = (16, 16) start = timer() dA = cuda.to_device(A) dB = cuda.to_device(B) dF = cuda.to_device(F, copy=False) diagproduct[griddim, blockdim](dF, dA, dB) dF.to_host() cuda_time = timer() - start diff = np.abs(FE) print("Maximum CUDA error %f" % np.max(diff)) print("Numpy took %f seconds" % numpy_time) print("CUBLAS took %f seconds, %0.2fx speedup" % (cublas_time, numpy_time / cublas_time)) print("CUDA JIT took %f seconds, %0.2fx speedup" % (cuda_time, numpy_time / cuda_time)) if __name__ == '__main__': main()

The kernel is much faster because SGEMM performs the full matrix matrix (O (n ^ 3)) and expands the diagonal to the full matrix. The diagproduct function diagproduct smarter. It simply does one multiplication for each matrix element and never expands the diagonal to a full matrix. Here are the results on my NVIDIA Tesla K20c GPU for N = 1000:

 Maximum CUBLAS error 0.000000 Maximum CUDA error 0.000000 Numpy took 0.024535 seconds CUBLAS took 0.010345 seconds, 2.37x speedup CUDA JIT took 0.004857 seconds, 5.05x speedup

Timing includes all copies on and from the GPU, which is a significant bottleneck for small matrices. If we set N to 10000 and run it again, we get much more acceleration:

 Maximum CUBLAS error 0.000000 Maximum CUDA error 0.000000 Numpy took 7.245677 seconds CUBLAS took 1.371524 seconds, 5.28x speedup CUDA JIT took 0.264598 seconds, 27.38x speedup

However, for very small matrices, CUBLAS SGEMM has an optimized path, so it is closer to CUDA performance. Here N = 100

 Maximum CUBLAS error 0.000000 Maximum CUDA error 0.000000 Numpy took 0.006876 seconds CUBLAS took 0.001425 seconds, 4.83x speedup CUDA JIT took 0.001313 seconds, 5.24x speedup

Anaconda NumbaPro CUDA Error

More articles: