The algorithm detects duplicate / similar lines in the data body - for example, email topics, in Python

I am uploading a long list of my lines in the subject line, with the goal of finding lists of email addresses that I was a member of many years ago and would like to clear them of my Gmail account (which is getting pretty slow).

I specifically think about newsletters, which often come from the same address, and repeat the name of the product / service / group in the subject.

I know that I could search / sort by the total occurrence of elements from a specific email address (and I intend), but I would like to match this data with duplicate subject lines ....

Now many topic lines will not match the string match, but “Google Friends: our latest news” “Google Friends: what we do today” are more alike than a random topic, namely: “Virgin Airlines has a great sale today” Take a flight with Virgin Airlines

So - how can I start automatically extracting trends / examples of strings that may be more similar.

The approaches that I considered and discarded ("because there must be some better way"):

  • Extract all possible substrings and arrange them, how often they appear, and manually select the appropriate
  • Disabling the first word or two, and then counting the occurrence of each substring
  • Comparison of the Levenshtein distance between records
  • Some kind of line similarity index ...

, . , - .?

, , - , , .

"" - , , , / " " / "" , - , .

.

+3
2

( / ). ( , , ). , , , ( the, and ..-), , ( , , , , ) "". , , , .

+2

BLEU

, smooth- BLEU . BLEU - , , , , , . Smooth BLEU BLEU, , n-gram, .

Smooth-BLEU , , , n-gram , .

, Python BLEU, Perl NIST .

+1

Source: https://habr.com/ru/post/1743667/


All Articles