How to detect duplicate text with some fuzziness

Question

How to detect duplicate text with some fuzziness

Something else, I am writing a small script using Text :: DeDupe to remove duplicate blog posts before I have to look at them.

After reading the "Syntactic Clustering of Web Pages" on which the implementation is based, I would like to be able to find overlapping documents (for example, blog fragments, as opposed to the full text, may also be quoted).

Do you know of any other implementation in C, C ++ or perl that I can try before writing my own?

+3

diff text duplicates

dpavlin Oct 24 '08 at 15:46

source share

1 answer

dpavlin · Accepted Answer · 2010-04-26T17:44:36+0000

SpotSigs, , , :

soruce GitHub:

http://github.com/jzawodn/perl-text-spotsig

How to detect duplicate text with some fuzziness

More articles: