What is a good grouping strategy for similar words?

Let's say I have a list of movie names with spelling errors and slight variations like this -

"Pirates of the Caribbean: The Curse of the Black Pearl" "Pirates of the carribean" "Pirates of the Caribbean: Dead Man Chest" "Pirates of the Caribbean trilogy" "Pirates of the Caribbean" "Pirates Of The Carribean" 

How do I group or find such word sets, preferably using python and / or redis?

+6
source share
5 answers

Look at the "fuzzy match." Some useful tools in the stream below calculate the similarities between the lines.

I especially love the difflib module

 >>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy']) ['apple', 'ape'] >>> import keyword >>> get_close_matches('wheel', keyword.kwlist) ['while'] >>> get_close_matches('apple', keyword.kwlist) [] >>> get_close_matches('accept', keyword.kwlist) ['except'] 

https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

+14
source

You may notice that similar strings have a large common substring, for example:

"Bla bla bLa" and "Bla bla bRa" => the common substring is "Bla bla ba" (pay attention to the third word)

To find a common substring, you can use a dynamic programming algorithm. One of the algorithms is the Levenshtein distance (the distance between most similar lines is very small, and between more different distance values ​​more) - http://en.wikipedia.org/wiki/Levenshtein_distance .

You can also try to adapt the Soundex algorithm - http://en.wikipedia.org/wiki/Soundex for quick work.

So, after calculating the distance between all your lines, you should cluster them. The easiest way is k-means (but you need to determine the number of clusters). If you really do not know the number of clusters, you need to use hierarchical clustering. Please note that the number of clusters in your situation is the number of different movie names + 1 (for absolutely wrong lines).

+3
source

To add another review to Fredrik’s answer, you can also get inspiration from search engines like code like this one:

 def dosearch(terms, searchtype, case, adddir, files = []): found = [] if files != None: titlesrch = re.compile('>title<.*>/title<') for file in files: title = "" if not (file.lower().endswith("html") or file.lower().endswith("htm")): continue filecontents = open(BASE_DIR + adddir + file, 'r').read() titletmp = titlesrch.search(filecontents) if titletmp != None: title = filecontents.strip()[titletmp.start() + 7:titletmp.end() - 8] filecontents = remove_tags(filecontents) filecontents = filecontents.lstrip() filecontents = filecontents.rstrip() if dofind(filecontents, case, searchtype, terms) > 0: found.append(title) found.append(file) return found 

Source and additional information: http://www.zackgrossbart.com/hackito/search-engine-python/

Hi,

Max

+1
source

I believe that there are actually two different problems.

The first is spell correction. Here you can find in Python

http://norvig.com/spell-correct.html

The second option is more functional. Here is what I would do after casting spells. I would make a relationship function.

(sentence1, sentence2) if and only if sentence1 and sentence2 have rare common words. Rare, I mean words other than (The, what, is, etc.). You can take a look at the TF / IDF system to determine if two documents are related by their words. Just a little walk, I found this:

https://code.google.com/p/tfidf/

0
source

One approach is to pre-process all lines before matching them: convert everything to lowercase, standardize whitespace (for example, replace any spaces with separate spaces). If punctuation is not important for your ultimate goal, you can also remove all punctuation marks.

Levenshtein distance is usually used to determine line similarity, this should help you group lines that have small spelling errors.

0
source

Source: https://habr.com/ru/post/892003/


All Articles