How can I adjust the Levenshtein distance when classifying linguistically similar words (e.g. verb tenses, adjective comparisons, singular and plural)

I have no idea how to accomplish this task. I consider the frequency of the word, in fact, the basic form of the word (for example, running will be considered running). I looked at some implementations of Levenshtein distance (one implementation I come across from dotnerperls ).

I also tried a double metaphone, but that is not what I am looking for.

So, please give me some ideas on how to set up the Levenshtein distance algorithm when classifying linguistically similar words, since the algorithm is only for determining the number of corrections that need to be ignored if they are linguistically similar or not.

Example: 1. “running” will be considered as one of the cases of the word “run”, 2. “Word” will also be the appearance of the “word”, 3. “Fear” will NOT be considered the appearance of “equipment”

In addition, I implement it in C #.

Thanks in advance.

Edit: I edited it as Rene suggested. Another note: I am trying to consider whether a word is a substring of another word, but this implementation will not be so dynamic. Another idea, I think: "if adding -s or -ing to string1, string1 == string2, then string2 is the input of string1." However, this is not so, as some words have irregular plurals.

+4
source share
1 answer

The task you are trying to solve is called Stemming or Lemmatisation .

As you already found out, Levenshtein-Distance is not the way here. Common English language algorithms include Porter- and Snowball-Stemmer. If you are Google for this, I am sure you will find C # performing one of them.

+4
source

Source: https://habr.com/ru/post/1389788/


All Articles