CUDA: additional sizes for the block or only one?

I need to squarely root each element of the matrix (which is basically a vector of float values โ€‹โ€‹once in memory) using CUDA.

The matrix sizes are unknown "a priori" and can vary [2-20,000].

I was wondering: I could use (as Jonathan suggested here) one block dimension as follows:

int thread_id = blockDim.x * block_id + threadIdx.x; 

and check that thread_id is below the rows * columns ... it's pretty simple and straightforward.

But is there any specific reason for the performance, why should I use two (or even three) block block dimensions to perform such a calculation (bearing in mind that I have an afterall matrix) instead of one?

I think about problems with coalescence, for example, forcing all threads to read values โ€‹โ€‹sequentially

+4
source share
1 answer

Dimensions exist only for convenience, everything is linear inside, so from the point of view of efficiency there will be no advantage. Avoiding the calculation of a (far-fetched) linear index, as you showed above, would seem a little faster, but there was no difference in how the threads combine.

+6
source

Source: https://habr.com/ru/post/1345605/


All Articles