I had the same problem. To remove stop words using Lucene , you can use them by default using the EnglishAnalyzer.getDefaultStopSet(); method EnglishAnalyzer.getDefaultStopSet(); . Otherwise, you can create your own list of stop words.
The code below shows the correct version of your removeStopWords() :
public static String removeStopWords(String textFile) throws Exception { CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_48, new StringReader(textFile.trim())); tokenStream = new StopFilter(Version.LUCENE_48, tokenStream, stopWords); StringBuilder sb = new StringBuilder(); CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { String term = charTermAttribute.toString(); sb.append(term + " "); } return sb.toString(); }
To use a custom stop word list, use the following:
//CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); //this is Lucene set final List<String> stop_Words = Arrays.asList("fox", "the"); final CharArraySet stopSet = new CharArraySet(Version.LUCENE_48, stop_Words, true);
source share