Im currently giving the user the option to include stop words or not when filtering text for ngram frequencies. Typically, this is done as follows:
snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English", stopWords); shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer, this.getnGramLength());
stopWords is given either by a complete list of words to include in ngrams, or to remove from them. this.getnGramLength ()); just contains the current ngram length up to a maximum of three.
If I use stop words when filtering text, “satellite definitely falls to Earth” for trigrams, the output is:
No=1, Key=to, Freq=1 No=2, Key=definitely, Freq=1 No=3, Key=falling to earth, Freq=1 No=4, Key=satellite, Freq=1 No=5, Key=is, Freq=1 No=6, Key=definitely falling to, Freq=1 No=7, Key=definitely falling, Freq=1 No=8, Key=falling, Freq=1 No=9, Key=to earth, Freq=1 No=10, Key=satellite is, Freq=1 No=11, Key=is definitely, Freq=1 No=12, Key=falling to, Freq=1 No=13, Key=is definitely falling, Freq=1 No=14, Key=earth, Freq=1 No=15, Key=satellite is definitely, Freq=1
But if I do not use stop words for trigrams, the output will be as follows:
No=1, Key=satellite, Freq=1 No=2, Key=falling _, Freq=1 No=3, Key=satellite _ _, Freq=1 No=4, Key=_ earth, Freq=1 No=5, Key=falling, Freq=1 No=6, Key=satellite _, Freq=1 No=7, Key=_ _, Freq=1 No=8, Key=_ falling _, Freq=1 No=9, Key=falling _ earth, Freq=1 No=10, Key=_, Freq=3 No=11, Key=earth, Freq=1 No=12, Key=_ _ falling, Freq=1 No=13, Key=_ falling, Freq=1
Why do I see underscores? I would think to see simple unigrams, the "satellite of the fall", the "falling earth" and the "satellite earth"? Definitely, this is the set of stop words that I use.
I can just filter the results with underscores, but ...