Elasticsearch - EdgeNgram + highlight + term_vector = bad glare

Question

Elasticsearch - EdgeNgram + highlight + term_vector = bad glare

When I use the analyzer with edgengram (min = 3, max = 7, front) + term_vector = with_positions_offsets

With a document having text = "CouchDB"

When I search for "couc"

My highlight is on "cou", not "couc"

It seems that my highlighting is only in the minimum matching ku token, while I expect that on the exact token (if possible) or at least on the longest token found.

It works great without parsing text with term_vector = with_positions_offsets

What effect does term_vector = with_positions_offsets remove for execution?

+6

java search elasticsearch lucene n-gram

Sebastien lorber Jul 03 '12 at 2:19

source share

2 answers

I know this question is old, but it has not yet fully answered:

There is another option that can lead to this strange behavior:

You must set require_field_match to true if you do not want other document results to affect the selection of the current document, see http://www.elasticsearch.org/guide/reference/api/search/highlighting/

+4

David heidrich Apr 16 '13 at 10:45

source share

javanna · Accepted Answer · 2013-03-29T14:25:01+0000

When you set term_vector=with_positions_offsets for a specific field, it means that you keep the term "vectors per document" for that field.

When it comes to highlighting, vector vectors let you use a pronounced lucene vector marker, which is faster than a standard marker. The reason is that the standard marker does not have a quick way to highlight, because the index does not contain enough information (position and offset). He can only re-analyze the contents of the field, intercept offsets and positions, and do highlighting based on this information. This can take quite some time, especially with long text fields.

Using terminal vectors, you have enough information and do not need to re-analyze the text. The disadvantage is the index, which will increase markedly. I should add that since the vectors of the Lucene 4.2 vector are better compressed and stored in an optimized way. And also the new PostingsHighlighter, based on the ability to store offsets in the posting list, which requires even less space.

elasticsearch automatically uses the best way to make selection based on available information. If vector vectors are saved, it will use a fast vector marker, otherwise standard. After reindex without vectors of vectors, the selection will be performed using a standard marker. It will be slower, but the index will be smaller.

Regarding ngram fields, the described behavior is strange, since a fast vector marker should have better support for ngram fields, so I expect exactly the opposite result.

Elasticsearch - EdgeNgram + highlight + term_vector = bad glare

More articles: