Removing stop words from a string in Java

I have a line with a lot of words, and I have a text file containing some Stopwords that I need to remove from my line. Say I have a line

s="I love this phone, its super fast and there so much new and cool things with jelly bean....but of recently I've seen some bugs." 

After removing the stop words, the line should look like:

 "love phone, super fast much cool jelly bean....but recently bugs." 

I managed to achieve this, but the problem I am facing is that when there are adjacent stop words in the String, only the first one is deleted, and I get the result as:

 "love phone, super fast there much and cool with jelly bean....but recently seen bugs" 

Here is my stopwordslist.txt file: Stopwords

How can I solve this problem. Here is what I have done so far:

 int k=0,i,j; ArrayList<String> wordsList = new ArrayList<String>(); String sCurrentLine; String[] stopwords = new String[2000]; try{ FileReader fr=new FileReader("F:\\stopwordslist.txt"); BufferedReader br= new BufferedReader(fr); while ((sCurrentLine = br.readLine()) != null){ stopwords[k]=sCurrentLine; k++; } String s="I love this phone, its super fast and there so much new and cool things with jelly bean....but of recently I've seen some bugs."; StringBuilder builder = new StringBuilder(s); String[] words = builder.toString().split("\\s"); for (String word : words){ wordsList.add(word); } for(int ii = 0; ii < wordsList.size(); ii++){ for(int jj = 0; jj < k; jj++){ if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){ wordsList.remove(ii); break; } } } for (String str : wordsList){ System.out.print(str+" "); } }catch(Exception ex){ System.out.println(ex); } 
+6
source share
10 answers

The error is that you are deleting an item from the list you are going to. Let's say you have a wordsList that contains |word0|word1|word2| If ii is 1 , and if the if criterion is true, then you call wordsList.remove(1); . After that, your list |word0|word2| . ii then increments to 2 , and now it exceeds the size of your list, so word2 will never be checked.

From there, there are several solutions. For example, instead of deleting values, you can set the value to ". Or create a special list of" result ".

+2
source

Try the program below.

 String s="I love this phone, its super fast and there so" + " much new and cool things with jelly bean....but of recently I've seen some bugs."; String[] words = s.split(" "); ArrayList<String> wordsList = new ArrayList<String>(); Set<String> stopWordsSet = new HashSet<String>(); stopWordsSet.add("I"); stopWordsSet.add("THIS"); stopWordsSet.add("AND"); stopWordsSet.add("THERE'S"); for(String word : words) { String wordCompare = word.toUpperCase(); if(!stopWordsSet.contains(wordCompare)) { wordsList.add(word); } } for (String str : wordsList){ System.out.print(str+" "); } 

CONCLUSION: a love phone, its super fast so many new interesting things with jelly bean .... but recently I saw some errors.

+5
source

This is a much more elegant solution (IMHO) using only regular expressions:

  // instead of the ".....", add all your stopwords, separated by "|" // "\\b" is to account for word boundaries, ie not replace "his" in "this" // the "\\s?" is to suppress optional trailing white space Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?"); Matcher m = p.matcher("I love this phone, its super fast and there so much new and cool things with jelly bean....but of recently I've seen some bugs."); String s = m.replaceAll(""); System.out.println(s); 
+4
source

You can use the Replace All function, similar to this

 String yourString ="I love this phone, its super fast and there so much new and cool things with jelly bean....but of recently I've seen some bugs." yourString=yourString.replaceAll("stop" ,""); 
+2
source

Here try the following:

  String s="I love this phone, its super fast and there so much new and cool things with jelly bean....but of recently I've seen some bugs."; String stopWords[]={"love","this","cool"}; for(int i=0;i<stopWords.length;i++){ if(s.contains(stopWords[i])){ s=s.replaceAll(stopWords[i]+"\\s+", ""); //note this will remove spaces at the end } } System.out.println(s); 

Thus, the final conclusion will be without words that you do not want in it. Just get a list of stop words in the array and replace the desired line.
The output for my stop words:

 I phone, its super fast and there so much new and things with jelly bean....but of recently I've seen some bugs. 
+1
source

Instead, you do not use the approach below. Easier to read and understand:

 for(String word : words){ s = s.replace(word+"\\s*", ""); } System.out.println(s);//It will print removed word string. 
+1
source

Try using replaceAll api String strings like:

 String myString = "I love this phone, its super fast and there so much new and cool things with jelly bean....but of recently I've seen some bugs."; String stopWords = "I|its|with|but"; String afterStopWords = myString.replaceAll("(" + stopWords + ")\\s*", ""); System.out.println(afterStopWords); OUTPUT: love this phone, super fast and there so much new and cool things jelly bean....of recently 've seen some bugs. 
0
source

Try saving your stop words to the collection set, and then add the line to the list. After that, you can simply use 'removeAll' to get the result.

 Set<String> stopwords = new Set<>() //fill in the set with your file String s="I love this phone, its super fast and there so much new and cool things with jelly bean....but of recently I've seen some bugs."; List<String> listOfStrings = asList(s.split(" ")); listOfStrings.removeAll(stopwords); StringUtils.join(listOfStrings, " "); 

No for necessary cycles - they usually mean problems.

0
source

It seems that you are stopping, one word stops in a sentence, moving to another stop word: you need to delete all the stop words in each sentence.

You should try changing the code:

From:

 for(int ii = 0; ii < wordsList.size(); ii++){ for(int jj = 0; jj < k; jj++){ if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){ wordsList.remove(ii); break; } } } 

For something like:

 for(int ii = 0; ii < wordsList.size(); ii++) { for(int jj = 0; jj < k; jj++) { if(wordsList.get(ii).toLowerCase().contains(stopwords[jj]) { wordsList.remove(ii); } } } 

Note that break is removed, and stopword.contains(word) changed to word.contains(stopword) .

0
source

Recently, one of the projects required functionality to filter the stop / stem and curses from a given text or file, going through several blogs and reviews. created a simple library for filtering data / files and was available in maven. hope this can help someone.

https://github.com/uttesh/exude

  <dependency> <groupId>com.uttesh</groupId> <artifactId>exude</artifactId> <version>0.0.2</version> </dependency> 
-1
source

Source: https://habr.com/ru/post/980243/


All Articles