CUDA 5.0 Memory Alignment and Shared Access

Question

CUDA 5.0 Memory Alignment and Shared Access

I have an array of two-dimensional hosts with 10 rows and 96 columns. I load this array into the global memory of the cuda device linearly, i.e. Row1, row2, row3 ... row10.

The array is of type float. In my kernel, each thread accesses a single float value from the device’s global memory.

 The BLOCK_SIZE I use is = 96
 The GRID_DIM I use is = 10

Now that I understood from the “Cuda C Programming Guide” for coalescing calls, the pattern I am using is correct, search the memory location one by one. But there is a paragraph on the allocation of memory of 128 bytes of memory. Which I do not understand.

Q1) 128 bytes of memory alignment; Does this mean that every thread in warp should have access to 4 bytes, starting from address 0x00 (for example) to 0x80?

Q2). So, in the script, will I make untethered calls or not?

My understanding: one thread should make one memory access with should be 4 bytes, from the address range, for example from 0x00 to 0x80. If a stream from warp accesses a location beyond, its undisclosed access.

+1

memory-management cuda

fahad May 30 '13 at 22:42

source share

1 answer

Jeffrey Sax · Accepted Answer · 2013-05-30T23:04:45+0000

Loading from global memory is usually performed in the form of 128 byte blocks aligned on 128-byte boundaries. Shared memory access means that you save all calls from your warp to one piece of 128 bytes. (In old cards, the memory should have been available in order of stream identifier, but new cards no longer have this requirement.)

32 warp float, 128 . , . , . - a[32*i], 128- , .

, , warp .

96 , i warp a[i], . a[i+32] a[i+64].

, Q1 , 128 , 128 .

Q2 , , a[32*x+i] i x , , .

5.3.2.1.1 256 , , cudaMalloc, .

CUDA 5.0 Memory Alignment and Shared Access

More articles: