Using shingles and stop words with Elasticsearch and Lucene 4.4

Question

Using shingles and stop words with Elasticsearch and Lucene 4.4

In the index that I create, I am interested in executing a query, and then (using faces) that returns the tile of this query. Here is the analyzer that I use in the text:

{ "settings": { "analysis": { "analyzer": { "shingleAnalyzer": { "tokenizer": "standard", "filter": [ "standard", "lowercase", "custom_stop", "custom_shingle", "custom_stemmer" ] } }, "filter": { "custom_stemmer" : { "type": "stemmer", "name": "english" }, "custom_stop": { "type": "stop", "stopwords": "_english_" }, "custom_shingle": { "type": "shingle", "min_shingle_size": "2", "max_shingle_size": "3" } } } } }

The main problem is that with Lucene 4.4, stop filters no longer support the enable_position_increments parameter to remove tiles containing stop words. Instead, I get results like ..

red and yellow

 "terms": [ { "term": "red", "count": 43 }, { "term": "red _", "count": 43 }, { "term": "red _ yellow", "count": 43 }, { "term": "_ yellow", "count": 42 }, { "term": "yellow", "count": 42 } ]

Naturally, this BIG reduces the number of shingles returned. Is there a post-Lucene 4.4 way to manage this without further processing the results?

+6

elasticsearch lucene stop-words

ev0lution37 Dec 10 '14 at 20:31

source share

2 answers

im not sure if this helps, but in filler_token tile definitions you can use the filler_token parameter, which defaults to _ . set it, for example, to an empty line:

 $indexParams['body']['settings']['analysis']['filter']['shingle-filter']['filler_token'] = "";

https://www.elastic.co/guide/en/elasticsearch/reference/1.7/analysis-shingle-tokenfilter.html

+2

ulkas Nov 02 '15 at 9:26

source share

Curious · Accepted Answer · 2015-06-05T05:34:34+0000

This is probably not the best solution, but the dumbest thing would be to add another filter to your analyzer to kill the "_" filler tokens. In the example below, I called it "kill_fillers":

  "shingleAnalyzer": { "tokenizer": "standard", "filter": [ "standard", "lowercase", "custom_stop", "custom_shingle", "custom_stemmer", "kill_fillers" ], ...

Add the filter "kill_fillers" to the filter list:

 "filters":{ ... "kill_fillers": { "type": "pattern_replace", "pattern": ".*_.*", "replace": "", }, ... }

Using shingles and stop words with Elasticsearch and Lucene 4.4

More articles: