Elasticsearch has a built-in “highlighting” function that allows you to mark consistent terms in the results (more complex than it might sound at first, since the query syntax may include close matches, etc.).
I have HTML fields and Elasticsearch stomps all over the HTML syntax when I turn on the selection.
Can I make HTML-aware / secure HTML code highlighting this way?
I would like the highlight to apply to the text in the HTML document, and not to highlight any HTML markup that matches the search, i.e. a search for “p” could highlight <p>p</p>→ <p><mark>p</mark></p>.
My fields are indexed as " type: string".
Documentation:
encoder:
The encoder parameter can be used to determine how the selected text will be encoded. It can be either by default (without encoding) or html (it will exit html if you use html highlight tags).
.. but that the HTML escapes my already encoded HTML field, further violating it.
Here are two sample queries
- Using the default encoder:
Selected tags are inserted inside other tags, i.e. <p>→ <<tag1>p</tag1>>:
curl -XPOST -H 'Content-type: application/json' "http://localhost:7200/myindex/_search?pretty" -d '
{
"query": { "match": { "preview_html": "p" } },
"highlight": {
"pre_tags" : ["<tag1>"],
"post_tags" : ["</tag1>"],
"encoder": "default",
"fields": {
"preview_html" : {}
}
},
"from" : 22, "size" : 1
}'
GIVES:
...
"highlight" : {
"preview_html" : [ "<<tag1>p</tag1> class=\"text\">TOP STORIES</<tag1>p</tag1>><<tag1>p</tag1> class=\"text\">Middle East</<tag1>p</tag1>><<tag1>p</tag1> class=\"text\">Syria: Developments in Syria are main story in Middle East</<tag1>p</tag1>>" ]
}
...
- Using encoder
html:
The existing HTML syntax is escaped by elasticsearch, which destroys things, i.e. <p>→ <<tag1>p</tag1>>:
curl -XPOST -H 'Content-type: application/json' "http://localhost:7200/myindex/_search?pretty" -d '
{
"query": { "match": { "preview_html": "p" } },
"highlight": {
"pre_tags" : ["<tag1>"],
"post_tags" : ["</tag1>"],
"encoder": "html",
"fields": {
"preview_html" : {}
}
},
"from" : 22, "size" : 1
}'
GIVES:
...
"highlight" : {
"preview_html" : [ "<<tag1>p</tag1> class="text">TOP STORIES</<tag1>p</tag1>><<tag1>p</tag1> class="text">Middle East</<tag1>p</tag1>><<tag1>p</tag1> class="text">Syria: Developments in Syria are main story in Middle East</<tag1>p</tag1>>" ]
}
}
...