I am uploading a long list of my lines in the subject line, with the goal of finding lists of email addresses that I was a member of many years ago and would like to clear them of my Gmail account (which is getting pretty slow).
I specifically think about newsletters, which often come from the same address, and repeat the name of the product / service / group in the subject.
I know that I could search / sort by the total occurrence of elements from a specific email address (and I intend), but I would like to match this data with duplicate subject lines ....
Now many topic lines will not match the string match, but “Google Friends: our latest news” “Google Friends: what we do today” are more alike than a random topic, namely: “Virgin Airlines has a great sale today” Take a flight with Virgin Airlines
So - how can I start automatically extracting trends / examples of strings that may be more similar.
The approaches that I considered and discarded ("because there must be some better way"):
- Extract all possible substrings and arrange them, how often they appear, and manually select the appropriate
- Disabling the first word or two, and then counting the occurrence of each substring
- Comparison of the Levenshtein distance between records
- Some kind of line similarity index ...
, . , - .?
, , - , , .
"" - , , , / " " / "" , - , .
.