Matrix processor access and multiplication optimization

Im creating some internal optimized matrix wrapper in java (using JNI). Do I need to state this, can you give some tips on optimizing matrices? What I'm going to implement:

A matrix can be represented as four sets of buffers / arrays, one for horizontal access, one for vertical access, one for diagonal access, and a command buffer for calculating matrix elements only if necessary. Here is an illustration.

Matrix signature: 0 1 2 3 4 5 6 7 8 9 1 3 3 5 2 9 First(hroizontal) set: horSet[0]={0,1,2,3} horSet[1]={4,5,6,7} horSet[2]={8,9,1,3} horSet[3]={3,5,2,9} Second(vertical) set: verSet[0]={0,4,8,3} verSet[1]={1,5,9,5} verSet[2]={2,6,1,2} verSet[3]={3,7,3,9} Third(optional) a diagonal set: diagS={0,5,1,9} //just in case some calculation needs this Fourth(calcuation list, in a "one calculation one data" fashion) set: calc={0,2,1,3,2,5} --->0 means multiply by the next element 1 means add the next element 2 means divide by the next element so this list means ( (a[i]*2)+3 ) / 5 when only a[i] is needed. Example for fourth set: A.mult(2), A.sum(3), A.div(5), A.mult(B) (to list) (to list) (to list) (calculate *+/ just in time when A is needed ) so only one memory access for four operations. loop start a[i] = b[i] * ( ( a[i]*2) +3 ) / 5 only for A.mult(B) loop end 

So, as you can see above, when you need to access the elements of a column, the second set provides continuous access. No jumps. The same is achieved with the first set for horizontal access.

This should make some things easier, and some things more complicated:

  Easier: **Matrix transpozing operation. Just swapping the pointers horSet[x] and verSet[x] is enough. **Matrix * Matrix multiplication. One matrix gives one of its horizontal set and other matrix gives vertical buffer. Dot product of these must be highly parallelizable for intrinsics/multithreading. If the multiplication order is inverse, then horizontal and verticals are switched. **Matrix * vector multiplication. Same as above, just a vector can be taken as horizontal or vertical freely. Harder: ** Doubling memory requirement is bad for many cases. ** Initializing a matrix takes longer. ** When a matrix is multiplied from left, needs an update vertical-->horizontal sets if its going to be multiplied from right after.(same for opposite) (if a tranposition is taken between, this does not count) Neutral: ** Same matrix can be multiplied with two other matrices to get two different results such as A=A*B(saved in horizontal sets) A=C*A(saved in vertical sets) then A=A*A gives A*B*C*A(in horizontal) and C*A*A*B (in vertical) without copying A. ** If a matrix always multiplied from left or always from right, every access and multiplication will not need update and be contiguous on ram. ** Only using horizontals before transpozing, only using verticals after, should not break any rules. 

The main purpose is to have a matrix (a multiple of 8, a multiple of 8) in size and apply avx intrinsics with multiple threads (each tread runs on multiple at the same time).

I have achieved only a vector stock product. I will go into this, if you know programming, give direction.

The dotproduct I wrote (with internal functions) is 6 times faster than the cycle-deployed version (which is twice as fast as multiplying one at a time), also stucks while limiting memory bandwidth when multithreading is enabled in the shell (8x β†’ uses almost 20 GB / s, which is close to my ddr3 limit) I tried opencl already and it is slow for the processor, but great for gpu.

Thanks.

Edit: How will the "Block Matrix" buffer be executed? When multiplying large matrices, small patches are multiplied in a special way, and the cache is probably used to reduce access to main memory. But this will require even more updates between matrix multiplications between the vertical-horizontal diagonal and this block.

+4
source share
2 answers

Several libraries use Expression Templates to enable the application of very specific optimized features for a cascade of matrix operations.

C ++ Programming Lanuage also has a short chapter on Fused Operations (29.5.4, 4th Edition).

This allows the concatenation of Γ  la operators:

 M = A*B.transp(); // where M, A, B are matrices 

In this case, you want to have 3 classes:

 class Matrix; class Transposed { public: Transposed(Matrix &matrix) : m_matrix(matrix) {} Matrix & obj (void) { return m_matrix; } private: Matrix & m_matrix; }; class MatrixMatrixMulTransPosed { public: MatrixMatrixMulTransPosed(Matrix &matrix, Transposed &trans) : m_matrix(matrix), m_transposed(trans.obj()) {} Matrix & matrix (void) { return m_matrix; } Matrix & transposed (void) { return m_transposed; } private: Matrix & m_matrix; Matrix & m_transposed; }; class Matrix { public: MatrixMatrixMulTransPosed operator* (Transposed &rhs) { return MatrixMatrixMulTransPosed(*this, rhs); } Matrix& operator= (MatrixMatrixMulTransPosed &mmtrans) { // Actual computation goes here and is stored in this. // using mmtrans.matrix() and mmtrans.transposed() } }; 

You can promote this concept to have a spcialized function for every calculation that is critical to any average.

+1
source

This is actually equivalent to transposition caching. It looks like you intend to do it impatiently; I just calculated the transposition only when necessary, and remember it if you need it again. Thus, if you do not need it, it will never be calculated.

+1
source

Source: https://habr.com/ru/post/1491980/


All Articles