Do I need a copy of CUDA memory?

Estimated increased memory should increase the transfer speed from the host to the device ( api ). However, I found that I did not need to call cuMemcpyHtoD for the kernel to access the values, or cuMemcpyDtoA for the host to read the values ​​back. I did not think this would work, but this is happening:

__global__ void testPinnedMemory(double * mem)
{
    double currentValue = mem[threadIdx.x];
    printf("Thread id: %d, memory content: %f\n", threadIdx.x, currentValue);
    mem[threadIdx.x] = currentValue+10;
}

void test() 
{
    const size_t THREADS = 8;
    double * pinnedHostPtr;
    cudaHostAlloc((void **)&pinnedHostPtr, THREADS, cudaHostAllocDefault);

    //set memory values
    for (size_t i = 0; i < THREADS; ++i)
        pinnedHostPtr[i] = i;

    //call kernel
    dim3 threadsPerBlock(THREADS);
    dim3 numBlocks(1);
    testPinnedMemory<<< numBlocks, threadsPerBlock>>>(pinnedHostPtr);

    //read output
    printf("Data after kernel execution: ");
    for (int i = 0; i < THREADS; ++i)
        printf("%f ", pinnedHostPtr[i]);    
    printf("\n");
}

Output:

Data after kernel execution: 10.000000 11.000000 12.000000 13.000000 14.000000 15.000000 16.000000 17.000000
Thread id: 0, memory content: 0.000000
Thread id: 1, memory content: 1.000000
Thread id: 2, memory content: 2.000000
Thread id: 3, memory content: 3.000000
Thread id: 4, memory content: 4.000000
Thread id: 5, memory content: 5.000000
Thread id: 6, memory content: 6.000000
Thread id: 7, memory content: 7.000000

My questions:

  • Is copying with zero copy? I thought that only the attached fixed memory was zero.
  • If this is a null copy, why is there an explicit way to map it to a device ( cudaHostAlloc with the cudaHostAllocMapped option)

I use CUDA Toolkit 5.5, Quadro 4000 with a driver installed in TCC mode, and compilation options sm_20, compute_20

+4
2

! 2.x + TCC + 64- CUDA:)

, !

, CUDA:

  • , GPU ( ) GPU. ( ) .

  • - ( cudaHostAllocMapped), , .

, , ?

CUDA 4.0 ( ):

  • (Windows Linux) ​​ .

, 64- 2.0 . , , , , . , ; .

: , 2.0+ ( : https://developer.nvidia.com/cuda-gpus), 64 - , Windows TCC, UVA ( ) . : .

CUDA " - "

+8

Mapped memory - . , cudaHostAllocMapped. , cudaHostAllocDefault, "". , TCC 64- , , "-".

, UVA. .

, , , UVA (, 32- ).

( UVA):

-

, cudaMallocHost() cudaHostAlloc() . cudaHostAllocPortable cudaHostAllocMapped . , , , . cudaHostGetDevicePointer(), .

+5

Source: https://habr.com/ru/post/1525670/


All Articles