Detect Similar Sound Words in Ruby

I know SOUNDEX and the (double) Metaphone, but this does not allow me to check the similarity of words in general - for example, "Hello" sounds very similar to "Bye", but both of these methods will mark them as completely different.

Are there Ruby libraries or any methods that you know that are capable of determining the similarity between two words? (Either logical / not similar, or numeric 40% similar)

edit: Extra bonus points if there is an easy way to โ€œinsertโ€ another dialect or language!

+4
source share
2 answers

I think you describe the Levenshtein distance. And yes, there are gems for this. If you hit pure Ruby, go to the text stone.

$ gem install text 

The docs have more details, but here's the gist:

 Text::Levenshtein.distance('test', 'test') # => 0 Text::Levenshtein.distance('test', 'tent') # => 1 

If you are ok with the native extensions ...

 $ gem install levenshtein 

This usage is similar . Its performance is very good. (It processes ~ 1000 spelling corrections per minute on my systems.)

If you need to know how two words are similar, use the distance along the length of the word.

If you want a simple similarity test, consider something like this:

Unconfirmed, but directly:

 String.module_eval do def similar?(other, threshold=2) distance = Text::Levenshtein.distance(self, other) distance <= threshold end end 
+8
source

First, you can pre-process the words using a thesaurus database that converts words with similar meanings into the same word. There are various thesaurus databases, unfortunately, I could not find a decent free one for English ( http://www.gutenberg.org/etext/3202 - the one I found, but this does not show what kind of relationship specific words have ( for example, similar, on the contrary, an alternative meaning, etc.), so all the words on the same line have some relation, but you will not know what this relation is)

But, for example, for the Hungarian there is a good thesaurus database, but you do not have a sound / metaphone for Hungarian texts ...

If you have a database writing program that preprocesses texts, itโ€™s not too difficult (ultimately itโ€™s a simple search replacement, but you might want to pre-process the thesaurus database using a simplex or metaphone)

-1
source

Source: https://habr.com/ru/post/1305331/


All Articles