I have 100M number vectors ( Minhash fingerprints), each vector contains 100 integers from 0 to 65536, and I'm trying to do a quick search for similarities to this fingerprint database using the similarities to Jaccard , i.e. a given query vector (e.g. [1,0,30, 9, 42, ...]) find the intersection / union ratio of this query to the database of 100M-sets.
The requirement is to return k "nearest neighbors" of the request vector to <1 sec (not including indexing / time of the IO file) on the laptop. Therefore, it is obvious that some kind of indexing is required, and the question is what will be the most effective way to approach this.
Notes: I thought of using SimHash , but in this case you really need to know the size of the intersection of the sets in order to identify, not the net affinity / similarity, but Simhash will lose this information.
I tried using a simple location-sensitive hashing method as described in ch3 by Jeffrey Ullman , dividing each vector into 20 βstripsβ or fragments of length 5, converting these fragments into strings (for example, [1, 2, 45, 2, 3] β "124523") and using these lines as keys in a hash table, where each key contains "neighbor candidates." But the problem is that he creates too many candidates for some of these fragments, and changing the number of groups does not help.
alex source share