Why does cudaMalloc give me an error when I know that there is enough memory space?

I have a Tesla C2070, which should have 5636554752 bytes of memory.

However, this gives me an error:

int *buf_d = NULL; err = cudaMalloc((void **)&buf_d, 1000000000*sizeof(int)); if( err != cudaSuccess) { printf("CUDA error: %s\n", cudaGetErrorString(err)); return EXIT_ERROR; } 

How is this possible? Is this due to the maximum memory step? Here are the specifications for the GPU:

 Device 0: "Tesla C2070" CUDA Driver Version: 3.20 CUDA Runtime Version: 3.20 CUDA Capability Major/Minor version number: 2.0 Total amount of global memory: 5636554752 bytes Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores) Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 2147483647 bytes 

As for the machine I work for, it has 24 Intel® Xeon® X565 processors with Linux 5.4 distribution (Maverick).

Any ideas? Thanks!

+4
source share
1 answer

The main problem in the title of the question is that you don’t really know that you have enough memory, you assume that you are doing. The runtime API includes the cudaMemGetInfo function, which returns the amount of free memory on the device. When the context is established on the device, the driver should reserve space for the device code, local memory for each stream, buffer-buffers for printf support, a stack for each stream and heap for the malloc / new kernel call (see this answer for more information ) All of this can consume quite a lot of memory, leaving you with much less than the maximum available memory after backing up the ECC that you intend to use for your code. The API also includes cudaDeviceGetLimit , which you can use to query the amount of memory consumed when using the device runtime. There is also a companion call to cudaDeviceSetLimit , which can allow you to change the amount of memory that each runtime support component will reserve.

Even after you have adjusted the memory area while working to your liking and received the actual value of free memory from the driver, there is still a question of granularity and fragmentation. It is rare to allocate every byte of what the API will report as free. I usually do something like this when the goal is to try to highlight every available byte on the map:

 const size_t Mb = 1<<20; // Assuming a 1Mb page size here size_t available, total; cudaMemGetInfo(&available, &total); int *buf_d = 0; size_t nwords = total / sizeof(int); size_t words_per_Mb = Mb / sizeof(int); while(cudaMalloc((void**)&buf_d, nwords * sizeof(int)) == cudaErrorMemoryAllocation) { nwords -= words_per_Mb; if( nwords < words_per_Mb) { // signal no free memory break; } } // leaves int buf_d[nwords] on the device or signals no free memory 

(the note was never near the compiler, only safe on CUDA 3 or later). It is implicitly assumed that there is not one of the obvious sources of problems with large distributions (32-bit host operating system, Windows WDDM platform without TCC mode enabled, older known driver problems).

+10
source

Source: https://habr.com/ru/post/1391590/


All Articles