Tokenize, remove stop words using Lucene with Java

Question

Tokenize, remove stop words using Lucene with Java

I am trying tokenize and remove stop words from a txt file with Lucene. I have it:

public String removeStopWords(String string) throws IOException { Set<String> stopWords = new HashSet<String>(); stopWords.add("a"); stopWords.add("an"); stopWords.add("I"); stopWords.add("the"); TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string)); tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords); StringBuilder sb = new StringBuilder(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); while (tokenStream.incrementToken()) { if (sb.length() > 0) { sb.append(" "); } sb.append(token.toString()); System.out.println(sb); } return sb.toString(); }}

My main thing is this:

  String file = "..../datatest.txt"; TestFileReader fr = new TestFileReader(); fr.imports(file); System.out.println(fr.content); String text = fr.content; Stopwords stopwords = new Stopwords(); stopwords.removeStopWords(text); System.out.println(stopwords.removeStopWords(text));

This gives me an error, but I cannot understand why.

+4

java tokenize nlp lucene stop-words

whyname Jul 12 '13 at 23:17

source share

2 answers

user692704 · Answer 1 · 2014-05-16T15:54:15+0000

I had the same problem. To remove stop words using Lucene , you can use them by default using the EnglishAnalyzer.getDefaultStopSet(); method EnglishAnalyzer.getDefaultStopSet(); . Otherwise, you can create your own list of stop words.

The code below shows the correct version of your removeStopWords() :

 public static String removeStopWords(String textFile) throws Exception { CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_48, new StringReader(textFile.trim())); tokenStream = new StopFilter(Version.LUCENE_48, tokenStream, stopWords); StringBuilder sb = new StringBuilder(); CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { String term = charTermAttribute.toString(); sb.append(term + " "); } return sb.toString(); }

To use a custom stop word list, use the following:

 //CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); //this is Lucene set final List<String> stop_Words = Arrays.asList("fox", "the"); final CharArraySet stopSet = new CharArraySet(Version.LUCENE_48, stop_Words, true);

user3370153 · Answer 2 · 2014-03-02T07:11:06+0000

you can try calling tokenStream.reset () before calling tokenStream.incrementToken ()

Tokenize, remove stop words using Lucene with Java

More articles: