I am currently working on a streaming API that generates a lot of text content. As expected, the API gives a lot of duplicates, and we also have a business requirement for filtering nearby duplicate data.
I did a little research on duplicate detection in data streams and read about Stable flower filters . Stable flowering filters are data structures for duplicate detection in data streams with an upper limit of false positive rate.
But I want to identify around duplicates, and I also looked at Hash algorithms like LSH and MinHash, which are used in the Nearest Neighbor and Near Duplicate Detection problems.
I was kind of stuck and looking for pointers on how to proceed, and documents / implementations that I could look at?
source
share