Inverse breeding process

I am using a lucene snowball analyzer to perform a stem. Results are not meaningful words. I called it a question .

One solution is to use a database containing a map between the original version of the word and one stable version of the word. (An example from community to community, regardless of what was the basis for the commute (community / or some other word))

I want to know if there is a database that performs such a function.

+6
source share
3 answers

It is theoretically impossible to restore a specific word from a stem, since one stem can be common to many words. One of the possibilities, depending on your application, would be to create a database of stems, each of which was mapped to an array of several words. But then you will need to predict which of these words is appropriate, given the original meaning for the re-conversion.

As a very naive solution to this problem, if you know word tags, you can try storing words with tags in your database:

run: NN: runner VBG: running VBZ: runs 

Then, given the stem “run” and the tag “NN”, you can determine that “runner” is the most likely word in this context. Of course, this decision is far from perfect. It is noteworthy that you will need to handle the fact that the same word form can be marked differently in different contexts. But remember that any attempt to solve this problem will be, at best, an approximation.

Edit: from the comments below, it looks like you probably want to use lemmatization instead of failing. Here's how to get word lemmas using NLP tools in Stanford Core :

 import java.util.*; import edu.stanford.nlp.pipeline.*; import edu.stanford.nlp.ling.*; import edu.stanford.nlp.ling.CoreAnnotations.*; Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma"); pipeline = new StanfordCoreNLP(props, false); String text = "Hello, world!"; Annotation document = pipeline.process(text); for(CoreMap sentence: document.get(SentencesAnnotation.class)) { for(CoreLabel token: sentence.get(TokensAnnotation.class)) { String word = token.get(TextAnnotation.class); String lemma = token.get(LemmaAnnotation.class); } } 
+4
source

The question you refer to contains an important piece of information that is often ignored. What you need is known as "lemmatisation" - the reduction of distorted words to their canonical form. This is connected, but different from exhaustion, and remains an open research question. This is especially difficult for languages ​​with more complex morphology (English is not so difficult). Wikipedia contains a list of programs that you can try. Another tool that I used is TreeTagger - it is very fast and reasonably accurate, although the main goal is tagging and lemmamentation of a part of speech is just a bonus. Try googling for “statistical lemmatization” (yes, I have strong feelings regarding statistical and rule-based NLP)

+2
source

You can take a look at the NCI Metathesaurus - although they are mostly biomedical in nature, they offer examples of natural language processing and some open source Java tools that may be useful when viewing their code.

+1
source

Source: https://habr.com/ru/post/909541/


All Articles