Qt and CUDA VIsual Profiler error in memory transfer size

Question

Qt and CUDA VIsual Profiler error in memory transfer size

I prepared a .pro file for using Qt and CUDA on a Linux machine (64 bit). When I run the application in the CUDA profiler, the application runs 12 times, but before that the results get the following error:

Error in profiler data file '/home/myusername/development/qtspace/bin/temp_compute_profiler_0_0.csv' in line 6 for transfer size by column.

The main.cpp file is as simple as

#include <QtCore/QCoreApplication> extern "C" void runCudaPart(); int main(int argc, char *argv[]) { QCoreApplication a(argc, argv); runCudaPart(); return 0; }

The fact is that if I remove "QCoreApplication a (argc, argv);" line CUDA Visual Profiler works as an exception and shows all results.

I checked that cuda_profile.log is generated from the command line if I export the environment variable CUDA_PROFILE = 1. A comma-separated file is also generated if I export COMPUTE_PROFILE_CSV = 1 variale, but CUDA Visual Profiler crashes when trying to import this file.

Any hints of this problem? Something seems to be related to the CUDA Visual Profiler visual application, not the code.

If you're wondering why I made main.cpp so simple with Qt, but without using Qt: P, I would like to improve the structure in the future to add a graphical interface.

// details of the versions of CUDA, GPU, OS, QT and the compiler

  Device"GeForce GTX 480" CUDA Driver Version: 3.20 CUDA Runtime Version: 3.20 CUDA Capability Major/Minor version number: 2.0 OS: ubuntu 10.04 LTS QT_VERSION: 263682 QT_VERSION_STR: 4.6.2 gcc version 4.4.3 nvcc compilation tool, release 3.2, V0.2.122

I noticed that the problem is with the QCoreApplication construct. He does something with arguments. If I change the line as follows:

 QCoreApplication a();

Visual Profiler works as excluded. It is difficult to know what is happening, and if this change will be a problem in the future. Any clues?

Regarding the QCoreApplication construct, this example also works if I call the cuda part before QCoreApplication.

 // this way the example works. runCudaPart(); QCoreApplication a(argc, argv);

Thanks in advance.

+6

profiling qt cuda

pQB Apr 26 '11 at 16:47

source share

3 answers

@pQB Hi, I'm Ramesh from NVIDIA. We could not reproduce this problem locally here. This error occurs when the value for this column is empty or invalid. In your case (an error in the profiler data file "/home/myusername/development/qtspace/bin/temp_compute_profiler_0_0.csv" in line number 6 for the transfer size by column) the transfer size value of the column is "empty" or "invalid" for line No. 6 in the csv file.

Can you send 'temp_compute_profiler_0_0.csv if it is present in your working directory and csv generated by the command line profiler. If it is not possible to verify what value you will get for this column (transmission size in memory) in row No. 6.

Do you run your application with default settings in Visual Profiler? Can you try disabling the option "memory transfer size"? To disable this option, click the "Session-> Session Settings ..." menu, in the session settings dialog box, click the "Other Settings" tab, uncheck the "Transmission size in memory" box

+2

Ramsh Jul 01 '11 at 13:45

source share

I could not reproduce this problem. Check: - Ubuntu 10.10, 64-bit and FC6, 32-bit - Qt4.5 and Q4.7 - CUDA 4.0 components

0

Ramsh Jul 05 '11 at 12:23

source share

talonmies · Accepted Answer · 2011-06-29T14:23:31+0000

I cannot reproduce this using CUDA 3.2 and QT4 on a 64-bit Ubuntu 10.04LTS system. I took this main thing:

 #include <QtCore/QCoreApplication> extern float cudamain(); int main(int argc, char *argv[]) { QCoreApplication a(argc, argv); float gflops = cudamain(); return 0; }

and a cudamain() containing this:

 #include <assert.h> #define blocksize 16 #define HM (4096) #define WM (4096) #define WN (4096) #define HN WM #define WP WN #define HP HM #define PTH WM #define PTW HM __global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN) { __shared__ float MS[blocksize][blocksize]; __shared__ float NS[blocksize][blocksize]; int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y; int rowM=ty+by*blocksize; int colN=tx+bx*blocksize; float Pvalue=0; for(int m=0; m<uWM; m+=blocksize){ MS[ty][tx]=M[rowM*uWM+(m+tx)] ; NS[ty][tx]=M[colN + uWN*(m+ty)]; __syncthreads(); for(int k=0;k<blocksize;k++) Pvalue+=MS[ty][k]*NS[k][tx]; __syncthreads(); } P[rowM*WP+colN]=Pvalue; } inline void gpuerrorchk(cudaError_t state) { assert(state == cudaSuccess); } float cudamain(){ cudaEvent_t evstart, evstop; cudaEventCreate(&evstart); cudaEventCreate(&evstop); float*M=(float*)malloc(sizeof(float)*HM*WM); float*N=(float*)malloc(sizeof(float)*HN*WN); for(int i=0;i<WM*HM;i++) M[i]=(float)i; for(int i=0;i<WN*HN;i++) N[i]=(float)i; float*P=(float*)malloc(sizeof(float)*HP*WP); float *Md,*Nd,*Pd; gpuerrorchk( cudaMalloc((void**)&Md,HM*WM*sizeof(float)) ); gpuerrorchk( cudaMalloc((void**)&Nd,HN*WN*sizeof(float)) ); gpuerrorchk( cudaMalloc((void**)&Pd,HP*WP*sizeof(float)) ); gpuerrorchk( cudaMemcpy(Md,M,HM*WM*sizeof(float),cudaMemcpyHostToDevice) ); gpuerrorchk( cudaMemcpy(Nd,N,HN*WN*sizeof(float),cudaMemcpyHostToDevice) ); dim3 dimBlock(blocksize,blocksize);//(tile_width , tile_width); dim3 dimGrid(WN/dimBlock.x,HM/dimBlock.y);//(width/tile_width , width/tile_witdh); gpuerrorchk( cudaEventRecord(evstart,0) ); nonsquare<<<dimGrid,dimBlock>>>(Md,Nd,Pd,WM, WN); gpuerrorchk( cudaPeekAtLastError() ); gpuerrorchk( cudaEventRecord(evstop,0) ); gpuerrorchk( cudaEventSynchronize(evstop) ); float time; cudaEventElapsedTime(&time,evstart,evstop); gpuerrorchk( cudaMemcpy(P,Pd,WP*HP*sizeof(float),cudaMemcpyDeviceToHost) ); cudaFree(Md); cudaFree(Nd); cudaFree(Pd); float gflops=(2.e-6*WM*WM*WM)/(time); cudaThreadExit(); return gflops; }

(do not pay attention to the actual code, except that it commits memory transactions and starts the kernel, this is nonsense otherwise).

Compiling the code as follows:

 cuda:~$ nvcc -arch=sm_20 -c -o cudamain.o cudamain.cu cuda:~$ g++ -o qtprob -I/usr/include/qt4 qtprob.cc cudamain.o -L $CUDA_INSTALL_PATH/lib64 -lQtCore -lcuda -lcudart cuda:~$ ldd qtprob linux-vdso.so.1 => (0x00007fff242c8000) libQtCore.so.4 => /opt/cuda-3.2/computeprof/bin/libQtCore.so.4 (0x00007fbe62344000) libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007fbe61a3d000) libcudart.so.3 => /opt/cuda-3.2/lib64/libcudart.so.3 (0x00007fbe617ef000) libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007fbe614db000) libm.so.6 => /lib/libm.so.6 (0x00007fbe61258000) libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fbe61040000) libc.so.6 => /lib/libc.so.6 (0x00007fbe60cbd000) libz.so.1 => /lib/libz.so.1 (0x00007fbe60aa6000) libgthread-2.0.so.0 => /usr/lib/libgthread-2.0.so.0 (0x00007fbe608a0000) libglib-2.0.so.0 => /lib/libglib-2.0.so.0 (0x00007fbe605c2000) librt.so.1 => /lib/librt.so.1 (0x00007fbe603ba000) libpthread.so.0 => /lib/libpthread.so.0 (0x00007fbe6019c000) libdl.so.2 => /lib/libdl.so.2 (0x00007fbe5ff98000) /lib64/ld-linux-x86-64.so.2 (0x00007fbe626c0000) libpcre.so.3 => /lib/libpcre.so.3 (0x00007fbe5fd69000)

creates an executable file that processes without errors as many times as I need to run using the CUDA 3.2 profiler.

All I can advise is to try my example for playback and see if it works or not. If this fails, you may have either a broken CUDA installation or QT. If this does not work (and I suspect that it is not), then you will have a problem with how you build the QT project or the actual CUDA code that you use yourself.

Qt and CUDA VIsual Profiler error in memory transfer size

More articles: