The algorithm detects duplicate / similar lines in the data body - for example, email topics, in Python

Question

The algorithm detects duplicate / similar lines in the data body - for example, email topics, in Python

I am uploading a long list of my lines in the subject line, with the goal of finding lists of email addresses that I was a member of many years ago and would like to clear them of my Gmail account (which is getting pretty slow).

I specifically think about newsletters, which often come from the same address, and repeat the name of the product / service / group in the subject.

I know that I could search / sort by the total occurrence of elements from a specific email address (and I intend), but I would like to match this data with duplicate subject lines ....

Now many topic lines will not match the string match, but “Google Friends: our latest news” “Google Friends: what we do today” are more alike than a random topic, namely: “Virgin Airlines has a great sale today” Take a flight with Virgin Airlines

So - how can I start automatically extracting trends / examples of strings that may be more similar.

The approaches that I considered and discarded ("because there must be some better way"):

Extract all possible substrings and arrange them, how often they appear, and manually select the appropriate
Disabling the first word or two, and then counting the occurrence of each substring
Comparison of the Levenshtein distance between records
Some kind of line similarity index ...

, . , - .?

, , - , , .

"" - , , , / " " / "" , - , .

.

+3

python string email data-mining fuzzy-search

Rizwan Kassim 02 '10 0:27

2

BLEU

, smooth- BLEU . BLEU - , , , , , . Smooth BLEU BLEU, , n-gram, .

Smooth-BLEU , , , n-gram , .

, Python BLEU, Perl NIST .

+1

dmcer 03 '10 0:32

Alex Martelli · Accepted Answer · 2010-05-02T00:54:38+0000

( / ). ( , , ). , , , ( the, and ..-), , ( , , , , ) "". , , , .

The algorithm detects duplicate / similar lines in the data body - for example, email topics, in Python

More articles: