Remove stop words from parsed content using OpenNLP

I analyzed the document using the OpenNLP parser code provided in this link , and I got the following output:

(TOP (S (NP (NN Programcreek)) (VP (VBZ is) (NP (DT a) (ADJP (RB very) (JJ huge) (CC and) (JJ useful)) (NN website))))) 

From this I want to extract only meaningful words, that is, I want to remove all temporary words, because I want to make the classification further on the basis of these significant words. Can you suggest me how to remove stop words from parsed output?

Finally, I want to get below output

  (TOP (S (NP (NN Programcreek)) (JJ useful)) (NN website))))) 

Please help me with this if this is not possible with OpenNLP, then offer me any other Java library for natural language processing. Because my main goal is to analyze the document and get only meaningful words.

+4
source share
2 answers

You can easily remove all stop words from text before passing it to OpenNLP.

  • Store stop words in an array
  • Sort the array by word length to avoid problems such as deleting "done" to "not done" and ultimately just "not"
  • Use regex to delete all words, making sure to ignore case and delete only whole words.

Here's how to do it in .NET, which you can port to Java.

 public string CleanStopWords(string inputText) { string[] stopWords = new string[] { "a", "all", "am", "an", "and", "any", "are", "aren't", "as", "at", "be", "because", "been", "to", "from", "by", "can", "can't", "do", "don't", "didn't", "did" }; stopWords = stopWords.OrderByDescending(w => w.Length).ToArray(); string outputText = Regex.Replace(inputText, "\\b" + string.Join("\\b|\\b", stopWords) + "\\b", "", RegexOptions.IgnoreCase); return outputText; } 
0
source

OpenNLP does not seem to support this feature. You will need to do as Olena Vicar proposes and implement it yourself, or use another NLP library in Java, such as Mallet.

The implementation in Java for removing stop words is as follows (it does not need to be sorted):

 String testText = "This is a text you want to test"; String[] stopWords = new String[]{"a", "able", "about", "above", "according", "accordingly", "across", "actually", "after", "afterwards", "again", "against", "all"}; String stopWordsPattern = String.join("|", stopWords); Pattern pattern = Pattern.compile("\\b(?:" + stopWordsPattern + ")\\b\\s*", Pattern.CASE_INSENSITIVE); Matcher matcher = pattern.matcher(testText); testText = matcher.replaceAll(""); 

You can use this list of English stop words.

Alternatively, using Mallet, you will need to follow the tutorial here . The part for removing stop words is determined using the tube for this purpose:

 pipeList.add(new TokenSequenceRemoveStopwords(false, false)); 

Mallet includes a list of stop words, so you do not need to define them, but it can be expanded if necessary.

Hope this helps.

0
source

Source: https://habr.com/ru/post/1492228/


All Articles