Close duplication in data streams

Question

Close duplication in data streams

I am currently working on a streaming API that generates a lot of text content. As expected, the API gives a lot of duplicates, and we also have a business requirement for filtering nearby duplicate data.

I did a little research on duplicate detection in data streams and read about Stable flower filters . Stable flowering filters are data structures for duplicate detection in data streams with an upper limit of false positive rate.

But I want to identify around duplicates, and I also looked at Hash algorithms like LSH and MinHash, which are used in the Nearest Neighbor and Near Duplicate Detection problems.

I was kind of stuck and looking for pointers on how to proceed, and documents / implementations that I could look at?

+5

duplicates filtering streaming bloom-filter

thickblood Apr 27 '12 at 10:24

source share

2 answers

http://micvog.com/2013/09/08/storm-first-story-detection/

+1

Ashwin Jayaprakash 18 . '13 5:25

Jeff Kubina · Accepted Answer · 2012-05-01T15:43:59+0000

( ) , - , , ; . MD5 ( - ) . MD5 ( 64- ) , , , , . , , .
, ( ), . SpotSigs . , Sigs() , x, Sigs(x) (1-5) 64- . - SpotSigs , , - . simhash ( ).
Sigs(), . SpotSigs , , , simhash.

Close duplication in data streams

More articles: