Binary binary bit

annotation

Hi, suppose you have two different independent 64-bit binary matrices A and T ( T is a transposed version of itself, using the transposed version of the matrix allows you to work on rows T rather than columns that are super-cool during multiplication for binary arithmetic) and you want to multiply these matrices, only that the matrix multiplication result is truncated to 64 bits, and if you yield value larger than the value of 1 in a specific cell of the matrix, the resulting cell matrix is soda press 1 otherwise 0

Example

  AT 00000001 01111101 01010100 01100101 10010111 00010100 10110000 00011000 <-- This matrix is transposed 11000100 00111110 10000011 10101111 11110101 11000100 10100000 01100010 

Binary and Traditional Multiplication Results:

  Binary Traditional 11000100 11000100 11111111 32212121 11111111 32213421 11111111 21112211 11101111 22101231 11001111 11001311 11111111 54213432 11001111 11001211 

Question

How do you multiply these matrices as described above in most effective questions?

PS

I tried to use binary and (i.e., the & operator) instead of doing multiplication by individual bits, in which case I had to prepare the data for multiplication:

 ulong u; u = T & 0xFF; u = (u << 00) + (u << 08) + (u << 16) + (u << 24) + (u << 32) + (u << 40) + (u << 48) + (u << 56); 

Now, executing binary and over two integers A and u , this will lead to the following:

  A u RC 00000001 01111101 00000001 1 01010100 01111101 01010100 3 10010111 01111101 00010101 3 10110000 01111101 00110000 2 11000100 01111101 01000100 2 10000011 01111101 00000001 1 11110101 01111101 01110101 5 10100000 01111101 00100000 1 

In the above example, R contains the result of multiplying bits of A by u , and to get the final value, we must sum all the bits in the string. Note that column C contains values ​​equal to the values ​​found in the first column of the resulting multiplication of the Traditional matrix above. The problem is that during this step I have to work with individual bits, which, in my opinion, are a suboptimal approach, I read http://graphics.stanford.edu/~seander/bithacks.html looking for a way to do this in parallel but no luck if anyone knows how to “flatten” and “combine” the values ​​located in column R , the result is a 64-bit matrix, I would appreciate if you drop me a few rows,

Thanks,

Edit

With great thanks to David Eisenstat, the final algorithm will look like this:

 var A = ...; var T = ...; // T == transpose(t), t is original matrix, algorithm works with transposed matrix var D = 0x8040201008040201UL; U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56); U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56); U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56); U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56); U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56); U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56); U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56); U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); 

The following code snippet:

  public static void Main (string[] args){ ulong U; var Random = new Xor128 (); var timer = DateTime.Now; var A = Random.As<IUniformRandom<UInt64>>().Evaluate(); var T = Random.As<IUniformRandom<UInt64>>().Evaluate(); var steps = 10000000; for (var i = 0; i < steps; i++) { ulong r = 0; var d = 0x8040201008040201UL; U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56); U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56); U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56); U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56); U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56); U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56); U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56); U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); } Console.WriteLine (DateTime.Now - timer); var m1 = new Int32[8,8]; var m2 = new Int32[8,8]; var m3 = new Int32[8,8]; for (int row = 0; row < 8; row++) { for (int col = 0; col < 8; col++) { m1 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1); m2 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1); m3 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1); } } timer = DateTime.Now; for (int i = 0; i < steps; i++) { for (int row = 0; row < 8; row++) { for (int col = 0; col < 8; col++) { var sum = 0; for (int temp = 0; temp < 8; temp++) { sum += m1 [row, temp] * m2 [temp, row]; } m3 [row, col] = sum; } } } Console.WriteLine (DateTime.Now - timer); } 

Shows me the following results:

 00:00:02.4035870 00:00:57.5147150 

And it's a 23x performance improvement on Mac OS X / Mono, thanks everyone

+7
source share
5 answers

I'm not sure about the most effective, but here is something to try. The following sequence of instructions calculates the main diagonal of the product A * T '. Turn both T and D 8 bits and repeat another 7 iterations.

 // uint64_t A, T; uint64_t D = UINT64_C(0x8040201008040201); uint64_t P = A & T; // test whether each byte is nonzero P |= P >> 1; P |= P >> 2; P |= P >> 4; P &= UINT64_C(0x0101010101010101); // fill each nonzero byte with ones P *= 255; // or P = (P << 8) - P; // leave only the current diagonal P &= D; 
+5
source

If you are looking for a way to multidimensional matrix multiplication, divide the result matrix into blocks and calculate each block in parallel.

http://en.wikipedia.org/wiki/Block_matrix#Block_matrix_multiplication

+2
source

It is not clear which data structure you are using, which language (yes, I know that you said “any language”), and what you are trying to optimize (speed? Memory?), Etc. All of them can have a profound effect on your decision.

Some examples:

  • Say this is C / C ++, and your matrices continue to bits in memory. Each row / column displays UINT8. In this case, the multiplication of a row with a column is reduced to performing an 8-bit bitwise & and checking if the result is greater than 0 (there is no need to sum the bits). It takes 2 processor instructions.
  • If you are forced to perform bitwise operations, use the bitwise "or" ( | ) instead of + . Some languages ​​may be too lazy to appreciate this, stopping at the first "1" they encounter.
  • If you can multithreaded, you can speed up the calculations.

By the way, I assume that you have many matrices for processing, otherwise I would use direct and readable code. I assume that even with a large number of matrices, the performance gain would be negligible.

+2
source

If you allow a lower-level build than C / C ++, then the SSE / AVX machine instructions along with the internal functions of the compiler allow you to write much faster code (4 times according to some test that I did). You need to use a non-standard vector variable (supported by at least GCC, ICC, CLang):

 using epu = uint8_t __attribute__((vector_size(16))); 

I use a class like

 class BMat8 { [...] private: uint64_t _data; }; 

then the following code should do what you want

 static constexpr epu rothigh { 0, 1, 2, 3, 4, 5, 6, 7,15, 8, 9,10,11,12,13,14}; static constexpr epu rot2 { 6, 7, 0, 1, 2, 3, 4, 5,14,15, 8, 9,10,11,12,13}; inline BMat8 operator*(BMat8 const& tr) const { epu x = _mm_set_epi64x(_data, _data); epu y = _mm_shuffle_epi8(_mm_set_epi64x(tr._data, tr._data), rothigh); epu data {}; epu diag = {0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80, 0x80,0x01,0x02,0x04,0x08,0x10,0x20,0x40}; for (int i = 0; i < 4; ++i) { data |= ((x & y) != epu {}) & diag; y = _mm_shuffle_epi8(y, rot2); diag = _mm_shuffle_epi8(diag, rot2); } return BMat8(_mm_extract_epi64(data, 0) | _mm_extract_epi64(data, 1)); } 

In particular, using a 128-bit register, I can do two iterations at the same time.

+1
source

A solution for strictly Boolean algebra can be achieved quite efficiently on x86-64 using the solution I described here:

fooobar.com/questions/682622 / ...

The only difference is that the data from the transposed matrix must also be extracted in columns and repacked into rows before each 64-bit product. Fortunately, this is trivial, using the BMI2 instruction for parallel bit extraction, available in GCC using the built-in _pext_u64:

 uint64_t mul8x8T (uint64_t A, uint64_t B) { const uint64_t COL = 0x0101010101010101; uint64_t C = 0; for (int i=0; i<8; ++i) { uint64_t p = COL & (A>>i); // select column uint64_t r = torow( COL & (B>>i) ); C |= (p*r); // use ^ for GF(2) instead } return C; } uint64_t torow (uint64_t c) { const uint64_t ROW = 0x00000000000000FF; // mask of the first row const uint64_t COL = 0x0101010101010101; // mask of the first column // select bits of c in positions marked by COL, // and pack them consecutively // last 'and' is included for clarity and is not // really necessary return _pext_u64(c, COL) & ROW; } 

In processors that do not support this specific instruction, one of the possible solutions is to adapt a typical bit trick for packing, which is used, for example, in the classic inversion of bits to the order of bits using 64-bit multiplication:

https://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith64BitsDiv

Using masks and integer multiplication with some constants results in a four-digit word containing the packed result as a bit substring, which can then be extracted using a bit shift and a mask.

The idea is to think of the step of multiplication as a parallel shift of bits, where each bit at the input is shifted by a different value specified in the constant. This is always possible until the steps of both numbers collide at some position in the result, that is, until each partial sum from the multiplication updates different bit positions as a result. This avoids any potential carryover, making the bitwise sum equivalent to a parallel OR (or XOR).

 uint64_t torow (uint64_t c) { const uint64_t ROW = 0x00000000000000FF; // select 8 lowest consecutive bits to get the first row const uint64_t COL = 0x0101010101010101; // select every 8th bit to get the first column const uint64_t DIA = 0x8040201008040201; // select every 8+1 bit to obtain a diagonal c *= ROW; // "copies" first column to the rest c &= DIA; // only use diagonal bits or else there will be position collisions and unexpected carries c *= COL; // "scatters" every bit to all rows after itself; the last row will now contain the packed bits return c >> 56; // move last row to first & discard the rest } 

There are other possible alternative implementations of this function that use more operations of lesser strength, the fastest of which will depend on the target architecture.

0
source

Source: https://habr.com/ru/post/952517/


All Articles