Fast scalar product of triple vectors

Consider two vectors: A and B , size n, 7 <= n <= 23. Both A and B consist of only -1s, 0s and 1s.

I need a fast algorithm that calculates the scalar product of A and B.

So far, I have been thinking of storing signs and values ​​in a separate uint32_t , using the following encoding:

  • sign 0, value 0 β†’ 0
  • sign 0, value 1 β†’ 1
  • sign 1, value 1 β†’ -1.

The C ++ implementation I was thinking about looks like this:

 struct ternary_vector { uint32_t sign, value; }; int inner_product(const ternary_vector & a, const ternary_vector & b) { uint32_t psign = a.sign ^ b.sign; uint32_t pvalue = a.value & b.value; psign &= pvalue; pvalue ^= psign; return __builtin_popcount(pvalue) - __builtin_popcount(psign); } 

This works pretty well, but I'm not sure if it can be done better. Any comments on this are greatly appreciated.

+6
source share
4 answers

I like to have 2 uint32_t , but I think your actual calculation is a bit wasteful

Just a few small points:

  • I'm not sure about the link (getting a and b to const & ) - this adds a level of indirection compared to putting them on the stack. When this code is small (maybe a few hours), this is important. Try moving by value and see what you get

  • __builtin_popcount can be, unfortunately, very inefficient. I used it myself, but found that even the very simple implementation that I wrote was much faster than this. However - it depends on the platform.

Basically, if the platform has a hardware implementation of popcount, __builtin_popcount uses it. If not, it uses a very inefficient replacement.

+3
source

One major issue here is the reuse of the psign and pvalue for positive and negative vectors. Thus, you do not execute either your compiler or yourself without cheating your code.

0
source

Could you code your ternary state in std::bitset<2> and define the product in terms of and ? For example, if your ternary types are:

  1 = P = (1, 1) 0 = Z = (0, 0) -1 = M = (1, 0) or (0, 1) 

I believe that you can define your product as:

 1 * 1 = 1 => P * P = P => (1, 1) & (1, 1) = (1, 1) = P 1 * 0 = 0 => P * Z = Z => (1, 1) & (0, 0) = (0, 0) = Z 1 * -1 = -1 => P * M = M => (1, 1) & (1, 0) = (1, 0) = M 

Then the inner work can begin with the adoption of and from the bits of the elements and ... I am working on how to add them.

Edit:

My stupid proposal did not consider that (-1)(-1) = 1 , which could not be processed by the proposal I proposed. Thanks @ user92382 for this.

0
source

Depending on your architecture, you can optimize temporary bit vectors - for example, if your code is compiled into FPGA or laid out in ASIC, then the sequence of logical operations will be better in terms of speed / energy / area than saving and reading / writing to two large buffers.

In this case, you can do:

 int inner_product(const ternary_vector & a, const ternary_vector & b) { return __builtin_popcount( a.value & b.value & ~(a.sign ^ b.sign)) - __builtin_popcount( a.value & b.value & (a.sign ^ b.sign)); } 

This will be very good - (a.value and b.value and ...) can enable / disable the XOR gate, the output of which is divided into two signed batteries, the first path NOTed before accumulation.

0
source

Source: https://habr.com/ru/post/957209/


All Articles