Preprocessing and clearing Java text

Could you advise me on java libraries for text preprocessing and cleaning? Lib should perform the following tasks:

  • convert all verbs to infinitive
  • convert all nouns to a special form
  • remove useless (for text) words
+4
source share
2 answers

The conversion of words to canonical forms (for example, verbs to infinitives and nouns to singular ones) is called lemmatization . One Java-lemmatizer Standford CoreNLP .

For β€œuseless words,” you probably want to β€œstop the words” - there is no standard list, but there are many floating around the Internet that function more or less the same, the only difference is how many words they include (usually from 100 to 1000). I already knew that before people used this list . When deleting stop words, remember to ignore the case when searching for matches.

+3
source

Not sure if this is all you need, but check mrsqg.

http://code.google.com/p/mrsqg/

-1
source

Source: https://habr.com/ru/post/1443835/


All Articles