Similarity Indexing

I have 100M number vectors ( Minhash fingerprints), each vector contains 100 integers from 0 to 65536, and I'm trying to do a quick search for similarities to this fingerprint database using the similarities to Jaccard , i.e. a given query vector (e.g. [1,0,30, 9, 42, ...]) find the intersection / union ratio of this query to the database of 100M-sets.

The requirement is to return k "nearest neighbors" of the request vector to <1 sec (not including indexing / time of the IO file) on the laptop. Therefore, it is obvious that some kind of indexing is required, and the question is what will be the most effective way to approach this.

Notes: I thought of using SimHash , but in this case you really need to know the size of the intersection of the sets in order to identify, not the net affinity / similarity, but Simhash will lose this information.

I tried using a simple location-sensitive hashing method as described in ch3 by Jeffrey Ullman , dividing each vector into 20 β€œstrips” or fragments of length 5, converting these fragments into strings (for example, [1, 2, 45, 2, 3] β†’ "124523") and using these lines as keys in a hash table, where each key contains "neighbor candidates." But the problem is that he creates too many candidates for some of these fragments, and changing the number of groups does not help.

+4
source share
2 answers

One way to do the following:

(1) Arrange the vectors in the tree (base tree).

(2) Request the tree using fuzzy criteria, in other words, the coincidence is that the difference in values ​​in each node of the tree is within a threshold

(3) From (2) we generate a subtree containing all the matching vectors

(4) Now repeat process (2) on the sub tree with a lower threshold

Continue until there are no K elements in the subtree. If K has too few elements, then take the previous tree and compute the distance on each member of the subtree and sort to eliminate the worst matches until you leave only the K elements.

+2
source

I might be a little late, but I would suggest IVFADC indexing from Jegou et al .: Product quantization for the closest neighboring search

It works for L2 Distance / dot similarity measurements and is a bit more complicated, but particularly effective in terms of time and memory.

It is also implemented in the FAISS library for similarities , so you can also look at that.

+2
source

Source: https://habr.com/ru/post/1483660/


All Articles