I have an array of two-dimensional hosts with 10 rows and 96 columns. I load this array into the global memory of the cuda device linearly, i.e. Row1, row2, row3 ... row10.
The array is of type float. In my kernel, each thread accesses a single float value from the device’s global memory.
The BLOCK_SIZE I use is = 96
The GRID_DIM I use is = 10
Now that I understood from the “Cuda C Programming Guide” for coalescing calls, the pattern I am using is correct, search the memory location one by one. But there is a paragraph on the allocation of memory of 128 bytes of memory. Which I do not understand.
Q1) 128 bytes of memory alignment; Does this mean that every thread in warp should have access to 4 bytes, starting from address 0x00 (for example) to 0x80?
Q2). So, in the script, will I make untethered calls or not?
My understanding: one thread should make one memory access with should be 4 bytes, from the address range, for example from 0x00 to 0x80. If a stream from warp accesses a location beyond, its undisclosed access.
fahad source
share