I have this code here (modified due to answer).
Info
32-byte stack frame, 0 bytes, spill storage, 0 bytes, load spills
ptxas info: 46 registers, 120 bytes cmem [0], 176 bytes cmem [2], 76 bytes cmem [16]
I don’t know what else needs to be considered in order to make it work for different combinations of numPointsRs and numPointsRp dots
When, for example, I run code with Rs = 10000 and Rp = 100000 with block = (128,1,1), grid = (200,1) is fine.
My calculations:
46 registers * 128threads = 5888 registers.
My card has a limit of 32768 registers, therefore 32768/5888 = 5 + some => 5 block / SM
(my card has a limit of 6).
With a busy calculator, I found that using 128 threads / blocks gives me 42% and I'm within my card.
In addition, the number of threads per MP is 640 (the limit is 1536)
Now, if I try to use Rs = 100000 and Rp = 100000 (for the same threads and blocks), it gives me a message in the header:
cuEventDestroy failed: start timeout
CuModuleUnload error: start timeout
1) I do not know / understand what else needs to be calculated.
2) I can not understand how we use / find the number of blocks. I can see that basically, someone puts (threads-1 + dots) / threads, but that still doesn't work.
-------------- UPDATED ----------------------------- ------ -----------
After using driver.Context.synchronize (), the code works for many points (1,000,000)!
But what effect does this addition to the code have? (in many cases, the screen freezes for 1 minute or more). Should I use it or not?
-------------- UPDATED2 ----------------------------- ------ -----------
Now the code does not work again, doing nothing!
Code Snapshot:
import pycuda.gpuarray as gpuarray import pycuda.autoinit from pycuda.compiler import SourceModule import numpy as np import cmath import pycuda.driver as drv import pycuda.tools as t #---- Initialization and passing(allocate memory and transfer data) to GPU ------------------------- Rs_gpu=gpuarray.to_gpu(Rs) Rp_gpu=gpuarray.to_gpu(Rp) J_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64)) M_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64)) Evec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64)) Hvec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64)) All_gpu=gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64)) #----------------------------------------------------------------------------------- mod =SourceModule(""" #include <pycuda-complex.hpp> #include <cmath> #include <vector> typedef pycuda::complex<float> cmplx; typedef float fp3[3]; typedef cmplx cp3[3]; __device__ __constant__ float Pi; extern "C"{ __device__ void computeEvec(fp3 Rs_mat[], int numPointsRs, cp3 J[], cp3 M[], fp3 Rp, cmplx kp, cmplx eta, cmplx *Evec, cmplx *Hvec, cmplx *All) { while (c<numPointsRs){ ... c++; } } __global__ void computeEHfields(float *Rs_mat_, int numPointsRs, float *Rp_mat_, int numPointsRp, cmplx *J_, cmplx *M_, cmplx kp, cmplx eta, cmplx E[][3], cmplx H[][3], cmplx *All ) { fp3 * Rs_mat=(fp3 *)Rs_mat_; fp3 * Rp_mat=(fp3 *)Rp_mat_; cp3 * J=(cp3 *)J_; cp3 * M=(cp3 *)M_; int k=threadIdx.x+blockIdx.x*blockDim.x; while (k<numPointsRp) { computeEvec( Rs_mat, numPointsRs, J, M, Rp_mat[k], kp, eta, E[k], H[k], All ); k+=blockDim.x*gridDim.x; } } } """ ,no_extern_c=1,options=['--ptxas-options=-v']) #call the function(kernel) func = mod.get_function("computeEHfields") func(Rs_gpu,np.int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu, M_gpu, np.complex64(kp), np.complex64(eta),Evec_gpu,Hvec_gpu, All_gpu, block=(128,1,1),grid=(200,1)) #----- get data back from GPU----- Rs=Rs_gpu.get() Rp=Rp_gpu.get() J=J_gpu.get() M=M_gpu.get() Evec=Evec_gpu.get() Hvec=Hvec_gpu.get() All=All_gpu.get()
My card:
Device 0: "GeForce GTX 560" CUDA Driver Version / Runtime Version 4.20 / 4.10 CUDA Capability Major/Minor version number: 2.1 Total amount of global memory: 1024 MBytes (1073283072 bytes) ( 0) Multiprocessors x (48) CUDA Cores/MP: 0 CUDA Cores //CUDA Cores 336 => 7 MP and 48 Cores/MP