I implemented the Matrix data type in C ++ using the 1D data type and wrapping it in rows and columns. Now I want to have this ability to create square / locked submatrices from now on, and I want to do this in memory.
The problem is that I want some of these submatrices to be transferred to the GPU memory and able to process them there in parallel. This, for example, is useful for Matrix Multiplication. Since these sub-matrices are not aligned in the main memory, copy them to the deviceβs memory, since one block does not seem possible without creating a separate copy? I want this direct copy of a copy of the GPU submatrix to the original processor matrix to increase efficiency and effectiveness. I do not know about the exact separation in advance.
Does anyone have any ideas how I can achieve this?
Just a reminder, the matrix should be partitioned in blocks, not a row, which will be relatively easy in C / C ++.
usman source share