How to add custom stop words using lucene in java

I use lucene to remove English Stop words, but my requirement is to remove English stop words and custom stop words. Below is my code to remove English stop words using lucene.

My sample code is:

public class Stopwords_remove { public String removeStopWords(String string) throws IOException { StandardAnalyzer ana = new StandardAnalyzer(Version.LUCENE_30); TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36,newStringReader(string)); StringBuilder sb = new StringBuilder(); tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, ana.STOP_WORDS_SET); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); while (tokenStream.incrementToken()) { if (sb.length() > 0) { sb.append(" "); } sb.append(token.toString()); } return sb.toString(); } public static void main(String args[]) throws IOException { String text = "this is a java project written by james."; Stopwords_remove stopwords = new Stopwords_remove(); stopwords.removeStopWords(text); } } 

Conclusion: java project written james.

output required: java project james.

How can i do this?

+4
source share
1 answer

You can add add additional stop words to a copy of the standard English word stop word or just add another StopFilter. How:

 TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36, new StringReader(string)); CharArraySet stopSet = CharArraySet.copy(Version.LUCENE_36, StandardAnalyzer.STOP_WORD_SET); stopSet.add("add"); stopSet.add("your"); stopSet.add("stop"); stopSet.add("words"); tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, stopSet); //Or, if you just need the added stopwords in a standardanalyzer, you could just pass this stopfilter into the StandardAnalyzer... //analyzer = new StandardAnalyzer(Version.LUCENE_36, stopSet); 

or

 TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36, new StringReader(string)); tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, StandardAnalyzer.STOP_WORDS_SET); List<String> stopWords = //your list of stop words..... tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, StopFilter.makeStopSet(Version.LUCENE_36, stopWords)); 

If you are trying to create your own analyzer, you may be better served with the following template, similar to the example in the analyzer documentation .

+4
source

Source: https://habr.com/ru/post/1494886/


All Articles