Something else, I am writing a small script using Text :: DeDupe to remove duplicate blog posts before I have to look at them.
After reading the "Syntactic Clustering of Web Pages" on which the implementation is based, I would like to be able to find overlapping documents (for example, blog fragments, as opposed to the full text, may also be quoted).
Do you know of any other implementation in C, C ++ or perl that I can try before writing my own?
source
share