Diminutive words arising / lemmatization

I am currently using "lucene" and "elasticsearch" and have the following problem. I need to get a form or lemma for the word diminutive . For instance:

  • dog โ†’ dog
  • kitty โ†’ cat

and etc.

But I get the following results:

  • dog โ†’ dog
  • kitty โ†’ kitti

Is there any way (it is not important to use the library, any algorithm, approach, etc.) to get the root / original word form for diminutive word forms?

Language of translation: Russian. For instance:

  • dog โ†’ dog
  • cat โ†’ cat

Thanks in advance!

+6
source share
1 answer

First, as a side note: what you are trying to do is usually not called stopping or lemmatizing.

Your first problem will display the observed token (e.g., dog) in its normalized form (e.g., dog) - Naively, you can do this by creating a SynonymFilter , which uses SynonymMap display dull forms in their canonical forms. However, you are likely to run into problems with any natural language, because not all conclusions are clear: for example, in German, Mรคdel ('girl' / 'lass') can be a diminutive form of Magda (an archaic word meaning "young woman '/' maid ') or Made (' maggot ').

One way to eliminate the ambiguity of these two forms would be to calculate the probability of each canonical form appearing in this context (for example, the history of the previous n tokens), and then replace the foggy form with the most probable canonical form (using a custom TokenFilter ) - see, for example , Wikipedia entry for defining the meaning of a word for different approaches.

+3
source

Source: https://habr.com/ru/post/975023/


All Articles