Close duplication in data streams

I am currently working on a streaming API that generates a lot of text content. As expected, the API gives a lot of duplicates, and we also have a business requirement for filtering nearby duplicate data.

I did a little research on duplicate detection in data streams and read about Stable flower filters . Stable flowering filters are data structures for duplicate detection in data streams with an upper limit of false positive rate.

But I want to identify around duplicates, and I also looked at Hash algorithms like LSH and MinHash, which are used in the Nearest Neighbor and Near Duplicate Detection problems.

I was kind of stuck and looking for pointers on how to proceed, and documents / implementations that I could look at?

+5
source share
2 answers
  • ( ) , - , , ; . MD5 ( - ) . MD5 ( 64- ) , , , , . , , .

  • , ( ), . SpotSigs . , Sigs() , x, Sigs(x) (1-5) 64- . - SpotSigs , , - . simhash ( ).

  • Sigs(), . SpotSigs , , , simhash.

+6

Source: https://habr.com/ru/post/1616140/


All Articles