Driver.Context.synchronize () - what else needs to be taken into account - the cleanup operation failed

Question

Driver.Context.synchronize () - what else needs to be taken into account - the cleanup operation failed

I have this code here (modified due to answer).

Info

32-byte stack frame, 0 bytes, spill storage, 0 bytes, load spills
ptxas info: 46 registers, 120 bytes cmem [0], 176 bytes cmem [2], 76 bytes cmem [16]

I don’t know what else needs to be considered in order to make it work for different combinations of numPointsRs and numPointsRp dots

When, for example, I run code with Rs = 10000 and Rp = 100000 with block = (128,1,1), grid = (200,1) is fine.

My calculations:

46 registers * 128threads = 5888 registers.
My card has a limit of 32768 registers, therefore 32768/5888 = 5 + some => 5 block / SM
(my card has a limit of 6).
With a busy calculator, I found that using 128 threads / blocks gives me 42% and I'm within my card.
In addition, the number of threads per MP is 640 (the limit is 1536)

Now, if I try to use Rs = 100000 and Rp = 100000 (for the same threads and blocks), it gives me a message in the header:

cuEventDestroy failed: start timeout
CuModuleUnload error: start timeout
1) I do not know / understand what else needs to be calculated.
2) I can not understand how we use / find the number of blocks. I can see that basically, someone puts (threads-1 + dots) / threads, but that still doesn't work.

-------------- UPDATED ----------------------------- ------ -----------

After using driver.Context.synchronize (), the code works for many points (1,000,000)!

But what effect does this addition to the code have? (in many cases, the screen freezes for 1 minute or more). Should I use it or not?

-------------- UPDATED2 ----------------------------- ------ -----------

Now the code does not work again, doing nothing!

Code Snapshot:

import pycuda.gpuarray as gpuarray import pycuda.autoinit from pycuda.compiler import SourceModule import numpy as np import cmath import pycuda.driver as drv import pycuda.tools as t #---- Initialization and passing(allocate memory and transfer data) to GPU ------------------------- Rs_gpu=gpuarray.to_gpu(Rs) Rp_gpu=gpuarray.to_gpu(Rp) J_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64)) M_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64)) Evec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64)) Hvec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64)) All_gpu=gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64)) #----------------------------------------------------------------------------------- mod =SourceModule(""" #include <pycuda-complex.hpp> #include <cmath> #include <vector> typedef pycuda::complex<float> cmplx; typedef float fp3[3]; typedef cmplx cp3[3]; __device__ __constant__ float Pi; extern "C"{ __device__ void computeEvec(fp3 Rs_mat[], int numPointsRs, cp3 J[], cp3 M[], fp3 Rp, cmplx kp, cmplx eta, cmplx *Evec, cmplx *Hvec, cmplx *All) { while (c<numPointsRs){ ... c++; } } __global__ void computeEHfields(float *Rs_mat_, int numPointsRs, float *Rp_mat_, int numPointsRp, cmplx *J_, cmplx *M_, cmplx kp, cmplx eta, cmplx E[][3], cmplx H[][3], cmplx *All ) { fp3 * Rs_mat=(fp3 *)Rs_mat_; fp3 * Rp_mat=(fp3 *)Rp_mat_; cp3 * J=(cp3 *)J_; cp3 * M=(cp3 *)M_; int k=threadIdx.x+blockIdx.x*blockDim.x; while (k<numPointsRp) { computeEvec( Rs_mat, numPointsRs, J, M, Rp_mat[k], kp, eta, E[k], H[k], All ); k+=blockDim.x*gridDim.x; } } } """ ,no_extern_c=1,options=['--ptxas-options=-v']) #call the function(kernel) func = mod.get_function("computeEHfields") func(Rs_gpu,np.int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu, M_gpu, np.complex64(kp), np.complex64(eta),Evec_gpu,Hvec_gpu, All_gpu, block=(128,1,1),grid=(200,1)) #----- get data back from GPU----- Rs=Rs_gpu.get() Rp=Rp_gpu.get() J=J_gpu.get() M=M_gpu.get() Evec=Evec_gpu.get() Hvec=Hvec_gpu.get() All=All_gpu.get()

My card:

 Device 0: "GeForce GTX 560" CUDA Driver Version / Runtime Version 4.20 / 4.10 CUDA Capability Major/Minor version number: 2.1 Total amount of global memory: 1024 MBytes (1073283072 bytes) ( 0) Multiprocessors x (48) CUDA Cores/MP: 0 CUDA Cores //CUDA Cores 336 => 7 MP and 48 Cores/MP

+4

cuda pycuda

George Sep 04 '12 at 12:55

source share

2 answers

A “launch timeout” seems to indicate that the kernel started too long and was killed by a watchdog. This can happen on GPUs that are also used to display graphics (for example, a graphical desktop), where the watchdog timer's task is to prevent the desktop from locking up for more than a few seconds. The best I can remember is that the watchdog timer is limited to about 5 seconds or so.

At any given time, the GPU can either start graphics or CUDA, so a watchdog timer is needed when starting the GUI to prevent GUI blocking for a long period of time, which makes the machine inoperative via the graphical interface.

If possible, avoid using this GPU for your desktop and / or other graphics (for example, do not start X if you are on Linux). If working without graphics is not an option, to reduce the kernel runtime, to avoid the kernel of the watchdog timer, you will have to do less work to start the kernel, optimize the code so that the kernel works faster for the same amount of work, or deploy a faster graphic CPU.

+3

njuffa Sep 14 '12 at 0:56

source share

Greg smith · Accepted Answer · 2012-09-14T03:33:38+0000

There are many problems that you have to deal with. Answer 1 provided by @njuffa is the best overall solution. I have provided more reviews based on the limited data that you have provided.

A PTX output of 46 registers is not the number of registers used by your kernel. PTX is an intermediate view. An offline or JIT compiler converts this to device code. Device code can use more or less registers. Nsight Visual Studio Edition, Visual Profiler, and CUDA command line profiler can provide you with the correct register count.
Employment calculation is not just RegistersPerSM / RegistersPerThread. Registers are allocated based on granularity. For CC 2.1, the granularity is 4 registers per line for deformation (128 registers). 2.x devices can actually stand out with a granularity of two registers, but this can lead to fragmentation later in the kernel.
Per class, you specify

My card has a limit of 32768 registers, so 32768/5888 = 5 + some => 5 block / SM (my card has a limit of 6).

I'm not sure what it means 6. There is 7 SM in your device. The maximum blocks for SM for 2.x devices are 8 blocks per SM.

You have provided insufficient code. If you provide code snippets, specify the size of all inputs, the number of cycles that will be executed each cycle, and a description of the operations for each function. Looking at the code, you can do too many loops per thread. Not knowing the order of magnitude of the outer cycle, we can only guess.
Given that the launch is a timeout, you should probably approach debugging as follows:

a. Add a line to the beginning of the code

 if (blockIdx.x > 0) { return; }

Run the exact code that you have in one of the previously mentioned profilers to estimate the duration of one block. Using the startup information provided by the profiler: register for the thread, shared memory ... use the occupancy calculator in the profiler or xls to determine the maximum number of blocks that you can run at the same time. For example, if the theoretical filling of blocks is 3 blocks per SM, and the number of SMs is 7, you can run 21 blocks at a time, which runs 9 waves for you. NOTE. This assumes equal work in the stream. Change the early exit code to allow 1 wave (21 blocks). If this is startup time, you need to reduce the amount of work per thread. If this passes, then calculate how many waves you have and calculate when you close the time (2 seconds on the windows,? On linux).

b. If you have too many waves, then reduce, you must reduce the launch configuration. Given that you are indexing gridDim.x and blockDim.x, you can do this by passing these measurements as parameters of your kernel. This will require tou to change your index code minimally. You will also have to pass the blockIdx.x offset. Change your host code to run a few cores back. Since conflict is not possible, you can run them in multiple threads to take advantage of the overlap at the end of each wave.

Driver.Context.synchronize () - what else needs to be taken into account - the cleanup operation failed

More articles: