Emphasizes Lucene output with stop words when receiving ngram frequencies

Question

Emphasizes Lucene output with stop words when receiving ngram frequencies

Im currently giving the user the option to include stop words or not when filtering text for ngram frequencies. Typically, this is done as follows:

snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English", stopWords); shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer, this.getnGramLength());

stopWords is given either by a complete list of words to include in ngrams, or to remove from them. this.getnGramLength ()); just contains the current ngram length up to a maximum of three.

If I use stop words when filtering text, “satellite definitely falls to Earth” for trigrams, the output is:

 No=1, Key=to, Freq=1 No=2, Key=definitely, Freq=1 No=3, Key=falling to earth, Freq=1 No=4, Key=satellite, Freq=1 No=5, Key=is, Freq=1 No=6, Key=definitely falling to, Freq=1 No=7, Key=definitely falling, Freq=1 No=8, Key=falling, Freq=1 No=9, Key=to earth, Freq=1 No=10, Key=satellite is, Freq=1 No=11, Key=is definitely, Freq=1 No=12, Key=falling to, Freq=1 No=13, Key=is definitely falling, Freq=1 No=14, Key=earth, Freq=1 No=15, Key=satellite is definitely, Freq=1

But if I do not use stop words for trigrams, the output will be as follows:

 No=1, Key=satellite, Freq=1 No=2, Key=falling _, Freq=1 No=3, Key=satellite _ _, Freq=1 No=4, Key=_ earth, Freq=1 No=5, Key=falling, Freq=1 No=6, Key=satellite _, Freq=1 No=7, Key=_ _, Freq=1 No=8, Key=_ falling _, Freq=1 No=9, Key=falling _ earth, Freq=1 No=10, Key=_, Freq=3 No=11, Key=earth, Freq=1 No=12, Key=_ _ falling, Freq=1 No=13, Key=_ falling, Freq=1

Why do I see underscores? I would think to see simple unigrams, the "satellite of the fall", the "falling earth" and the "satellite earth"? Definitely, this is the set of stop words that I use.

I can just filter the results with underscores, but ...

+4

lucene n-gram

Mr morgan Sep 19 '12 at 8:33

source share

1 answer

Gevorg · Accepted Answer · 2012-12-14T20:01:06+0000

The underscores are “missing stop word / s”. To avoid this behavior, you should set enablePositionIncrements to false , but SnowballAnalyzer (now deprecated in 4.0.0-Beta) does not allow you to do this.

One solution uses a standard analyzer without the initial stop words, and then decorate the output with StopFilter , SnowballFilter and ShingleFilter . Example for bigrams in Lucene 4.0.0-Beta:

 Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40, CharArraySet.EMPTY_SET); TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(input)); StopFilter stopFilter = new StopFilter(Version.LUCENE_40, tokenStream, stopWords); stopFilter.setEnablePositionIncrements(false); SnowballFilter snowballFilter = new SnowballFilter(stopFilter, "English"); ShingleFilter bigramShingleFilter = new ShingleFilter(snowballFilter, 2, 2);

Hope this puts you on the right track!

EDIT

Impossible more with Lucene v4.4 +, still looking for a nice alternative ...

Emphasizes Lucene output with stop words when receiving ngram frequencies

More articles: