CUDA 5.0 Memory Alignment and Shared Access

I have an array of two-dimensional hosts with 10 rows and 96 columns. I load this array into the global memory of the cuda device linearly, i.e. Row1, row2, row3 ... row10.

The array is of type float. In my kernel, each thread accesses a single float value from the device’s global memory.

 The BLOCK_SIZE I use is = 96
 The GRID_DIM I use is = 10

Now that I understood from the “Cuda C Programming Guide” for coalescing calls, the pattern I am using is correct, search the memory location one by one. But there is a paragraph on the allocation of memory of 128 bytes of memory. Which I do not understand.

Q1) 128 bytes of memory alignment; Does this mean that every thread in warp should have access to 4 bytes, starting from address 0x00 (for example) to 0x80?

Q2). So, in the script, will I make untethered calls or not?

My understanding: one thread should make one memory access with should be 4 bytes, from the address range, for example from 0x00 to 0x80. If a stream from warp accesses a location beyond, its undisclosed access.

+1
source share
1 answer

Loading from global memory is usually performed in the form of 128 byte blocks aligned on 128-byte boundaries. Shared memory access means that you save all calls from your warp to one piece of 128 bytes. (In old cards, the memory should have been available in order of stream identifier, but new cards no longer have this requirement.)

32 warp float, 128 . , . , . - a[32*i], 128- , .

, , warp .

96 , i warp a[i], . a[i+32] a[i+64].

, Q1 , 128 , 128 .

Q2 , , a[32*x+i] i x , , .

5.3.2.1.1 256 , , cudaMalloc, .

+8

Source: https://habr.com/ru/post/1671309/


All Articles