I need to squarely root each element of the matrix (which is basically a vector of float values โโonce in memory) using CUDA.
The matrix sizes are unknown "a priori" and can vary [2-20,000].
I was wondering: I could use (as Jonathan suggested here) one block dimension as follows:
int thread_id = blockDim.x * block_id + threadIdx.x;
and check that thread_id is below the rows * columns ... it's pretty simple and straightforward.
But is there any specific reason for the performance, why should I use two (or even three) block block dimensions to perform such a calculation (bearing in mind that I have an afterall matrix) instead of one?
I think about problems with coalescence, for example, forcing all threads to read values โโsequentially
source share