Cuda - out of memory (problems with threads and blocks) --Address goes beyond

Question

Cuda - out of memory (problems with threads and blocks) --Address goes beyond

I use 63 registers / threads, so (maximum 32768) I can use about 520 threads. In this example, I am now using 512 threads.

(parallelism is in the "computeEvec" function inside the global computeEHfields function.) Problems:

1) Memory check error below.

2) When I use numPointsRp> 2000, it shows me “out of memory”, but (if I'm not mistaken), I calculate global memory, and that’s fine.

------------------------------- UPDATED ------------ ------ ---------

I run the cuda-memcheck program and it gives me (only when numPointsRs> numPointsRp):

========= Invalid global 4 view
========= at 0x00000428 in computeEHfields
========== stream (2,0,0) in block (0,0,0)
========= Address 0x4001076e0 is out of bounds
=================== Invalid global reading of size 4
========= at 0x00000428 in computeEHfields
========== stream (1,0,0) in block (0,0,0)
========= Address 0x4001076e0 is out of bounds
=================== Invalid global reading of size 4
========= at 0x00000428 in computeEHfields
========= stream (0,0,0) in block (0,0,0)
========= Address 0x4001076e0 is out of bounds
ERROR SUMMARY: 160 errors

----------- EDIT ----------------------------

Also, several times (if I use only streams, not blocks (I have not tested it for blocks)), if, for example, I have numPointsRs = 1000 and numPointsRp = 100, then change numPointsRp = 200 and then again change numPointsRp = 100 I do not accept the first results!

import pycuda.gpuarray as gpuarray import pycuda.autoinit from pycuda.compiler import SourceModule import numpy as np import cmath import pycuda.driver as drv Rs=np.zeros((numPointsRs,3)).astype(np.float32) for k in range (numPointsRs): Rs[k]=[0,k,0] Rp=np.zeros((numPointsRp,3)).astype(np.float32) for k in range (numPointsRp): Rp[k]=[1+k,0,0] #---- Initialization and passing(allocate memory and transfer data) to GPU ------------------------- Rs_gpu=gpuarray.to_gpu(Rs) Rp_gpu=gpuarray.to_gpu(Rp) J_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64)) M_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64)) Evec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64)) Hvec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64)) All_gpu=gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64)) mod =SourceModule(""" #include <pycuda-complex.hpp> #include <cmath> #include <vector> #define RowRsSize %(numrs)d #define RowRpSize %(numrp)d typedef pycuda::complex<float> cmplx; extern "C"{ __device__ void computeEvec(float Rs_mat[][3], int numPointsRs, cmplx J[][3], cmplx M[][3], float *Rp, cmplx kp, cmplx eta, cmplx *Evec, cmplx *Hvec, cmplx *All) { while (c<numPointsRs){ ... c++; } } __global__ void computeEHfields(float *Rs_mat_, int numPointsRs, float *Rp_mat_, int numPointsRp, cmplx *J_, cmplx *M_, cmplx kp, cmplx eta, cmplx E[][3], cmplx H[][3], cmplx *All ) { float Rs_mat[RowRsSize][3]; float Rp_mat[RowRpSize][3]; cmplx J[RowRsSize][3]; cmplx M[RowRsSize][3]; int k=threadIdx.x+blockIdx.x*blockDim.x; while (k<numPointsRp) { computeEvec( Rs_mat, numPointsRs, J, M, Rp_mat[k], kp, eta, E[k], H[k], All ); k+=blockDim.x*gridDim.x; } } } """% { "numrs":numPointsRs, "numrp":numPointsRp},no_extern_c=1) func = mod.get_function("computeEHfields") func(Rs_gpu,np.int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu, M_gpu, np.complex64(kp), np.complex64(eta),Evec_gpu,Hvec_gpu, All_gpu, block=(128,1,1),grid=(200,1)) print(" \n") #----- get data back from GPU----- Rs=Rs_gpu.get() Rp=Rp_gpu.get() J=J_gpu.get() M=M_gpu.get() Evec=Evec_gpu.get() Hvec=Hvec_gpu.get() All=All_gpu.get()

-------------------- GPU model ------------------------- --- --------------------

 Device 0: "GeForce GTX 560" CUDA Driver Version / Runtime Version 4.20 / 4.10 CUDA Capability Major/Minor version number: 2.1 Total amount of global memory: 1024 MBytes (1073283072 bytes) ( 0) Multiprocessors x (48) CUDA Cores/MP: 0 CUDA Cores //CUDA Cores 336 => 7 MP and 48 Cores/MP

0

cuda pycuda

George Aug 28 '12 at 12:55

source share

2 answers

Using R = 1000 and then
block = R / 2,1,1 and grid = 1,1 everything is fine
If I try R = 10000 and
block = R / 20,1,1 and grid = 20,1, then it shows me "out of memory"

I am not familiar with pycuda and have not read deeply into your code either. However, you have more blocks and more threads, so there will be

local memory (possibly the kernel stack, it allocates for the thread ),
shared memory (allocated per block) or
global memory that is allocated based on grid or gridDim .

You can reduce the size of the stack by calling

 cudeDeviceSetLimit(cudaLimitStackSize, N));

(code for the C API, but the pycuda equivalent should not be too complicated to find).

+1

Dude Aug 31 '12 at 17:14

source share

talonmies · Accepted Answer · 2012-09-01T20:03:47+0000

When I use numPointsRp> 2000, it shows me "out of memory"

Now we have the real code to work, let it compile it and see what happens. Using RowRsSize=2000 and RowRpSize=200 and compiling with the CUDA 4.2 RowRpSize=200 , I get:

 nvcc -arch=sm_21 -Xcompiler="-D RowRsSize=2000 -D RowRpSize=200" -Xptxas="-v" -c -I./ kivekset.cu ptxas info : Compiling entry function '_Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_' for 'sm_21' ptxas info : Function properties for _Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_ 122432 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 57 registers, 84 bytes cmem[0], 168 bytes cmem[2], 76 bytes cmem[16]

Key numbers are 57 registers and 122432 bytes of the stack in the stream. The busy calculator assumes that a block of 512 threads will have a maximum of 1 block per SM, and your GPU has 7 SM. This gives a total of 122432 * 512 * 7 = 438796288 bytes of the stack frame (local memory) to run your kernel before you allocate one byte of memory for input and output using pyCUDA. On a GPU with a memory capacity of 1 GB, it is easy to imagine that there is a shortage of memory. Your kernel has a huge amount of local memory. Start thinking about how to reduce it.

As I pointed out in the comments, it is completely unclear why each thread needs a full copy of the input in this kernel code. This results in a gigantic amount of local memory, and there seems to be absolutely no reason why code should be written this way. I suspect you can change the kernel to something like this:

 typedef pycuda::complex<float> cmplx; typedef float fp3[3]; typedef cmplx cp3[3]; __global__ void computeEHfields2( float *Rs_mat_, int numPointsRs, float *Rp_mat_, int numPointsRp, cmplx *J_, cmplx *M_, cmplx kp, cmplx eta, cmplx E[][3], cmplx H[][3], cmplx *All ) { fp3 * Rs_mat = (fp3 *)Rs_mat_; cp3 * J = (cp3 *)J_; cp3 * M = (cp3 *)M_; int k=threadIdx.x+blockIdx.x*blockDim.x; while (k<numPointsRp) { fp3 * Rp_mat = (fp3 *)(Rp_mat_+k); computeEvec2( Rs_mat, numPointsRs, J, M, *Rp_mat, kp, eta, E[k], H[k], All ); k+=blockDim.x*gridDim.x; } }

and the main function is __device__, it calls the following:

 __device__ void computeEvec2( fp3 Rs_mat[], int numPointsRs, cp3 J[], cp3 M[], fp3 Rp, cmplx kp, cmplx eta, cmplx *Evec, cmplx *Hvec, cmplx *All) { .... }

and eliminate every byte of the local network stream without changing the functionality of the computing code in general.

Cuda - out of memory (problems with threads and blocks) --Address goes beyond

More articles: