Diminutive words arising / lemmatization

Question

Diminutive words arising / lemmatization

I am currently using "lucene" and "elasticsearch" and have the following problem. I need to get a form or lemma for the word diminutive . For instance:

dog → dog
kitty → cat

and etc.

But I get the following results:

dog → dog
kitty → kitti

Is there any way (it is not important to use the library, any algorithm, approach, etc.) to get the root / original word form for diminutive word forms?

Language of translation: Russian. For instance:

dog → dog
cat → cat

Thanks in advance!

+6

java nlp elasticsearch lucene morphological-analysis

IvanKurchenko Sep 09 '14 at 9:33

source share

1 answer

errantlinguist · Accepted Answer · 2014-12-04T13:05:31+0000

First, as a side note: what you are trying to do is usually not called stopping or lemmatizing.

Your first problem will display the observed token (e.g., dog) in its normalized form (e.g., dog) - Naively, you can do this by creating a SynonymFilter , which uses SynonymMap display dull forms in their canonical forms. However, you are likely to run into problems with any natural language, because not all conclusions are clear: for example, in German, Mädel ('girl' / 'lass') can be a diminutive form of Magda (an archaic word meaning "young woman '/' maid ') or Made (' maggot ').

One way to eliminate the ambiguity of these two forms would be to calculate the probability of each canonical form appearing in this context (for example, the history of the previous n tokens), and then replace the foggy form with the most probable canonical form (using a custom TokenFilter ) - see, for example , Wikipedia entry for defining the meaning of a word for different approaches.

Diminutive words arising / lemmatization

More articles: