How does shingleprinting work in practice?

I am trying to use shingleprinting to measure document similarity. The process includes the following steps:

  • Create 5-shingling from two documents D1, D2
  • Hash of each pebble with a 64-bit hash
  • Select an arbitrary permutation of numbers from 0 to 2 ^ 64-1 and apply to the hashes of the gong
  • For each document, find the smallest of the values ​​obtained.
  • If they match, consider it a positive example, unless you consider it a negative example
  • Repeat 3. to 5. several times
  • Use positive_examples / total examplesas a measure of likeness

Stage 3 involves generating a random permutation of a very long sequence. There can be no question of how to use Knuth-shuffle. Is there a reduction for this? Note that in the end we need only one element of the resulting permutation.

+3
source share
1 answer

Warning: I am not 100% sure, but I read some of the articles, and I believe that this is how it works. For example, in Peter Indyk's “Small, approximately minimal independent family of hash functions,” he writes: “In an implementation integrated with Altavista, the set H is chosen as a pairwise independent family of hash functions.”

3 [n] ( 1 n). , -. , , - h. h . min 4.

- h (x) = ax + b (mod p), a b , p - .

: http://www.cs.princeton.edu/courses/archive/fall08/cos521/hash.pdf http://people.csail.mit.edu/indyk/minwise99.ps

+3

Source: https://habr.com/ru/post/1753857/


All Articles