CUDA: additional sizes for the block or only one?

Question

CUDA: additional sizes for the block or only one?

I need to squarely root each element of the matrix (which is basically a vector of float values once in memory) using CUDA.

The matrix sizes are unknown "a priori" and can vary [2-20,000].

I was wondering: I could use (as Jonathan suggested here) one block dimension as follows:

int thread_id = blockDim.x * block_id + threadIdx.x;

and check that thread_id is below the rows * columns ... it's pretty simple and straightforward.

But is there any specific reason for the performance, why should I use two (or even three) block block dimensions to perform such a calculation (bearing in mind that I have an afterall matrix) instead of one?

I think about problems with coalescence, for example, forcing all threads to read values sequentially

+4

c ++ matrix cuda

Marco A. Mar 28 '11 at 19:17

source share

1 answer

Brian kretzler · Accepted Answer · 2011-03-28T19:39:14+0000

Dimensions exist only for convenience, everything is linear inside, so from the point of view of efficiency there will be no advantage. Avoiding the calculation of a (far-fetched) linear index, as you showed above, would seem a little faster, but there was no difference in how the threads combine.

CUDA: additional sizes for the block or only one?

More articles: