Marker use in lucene

I have two questions regarding the selection marker with Apache lucene:

  • see this function could you please explain the use of the stream token parameter.

  • I have several large lucene documents containing many fields, and each field has several lines. Now I have found the most relevant document for a specific request. Now this document was found because several words in the query could match the words in the document. I want to know what words in the query caused this. Therefore, for this I plan to use Lucene Hit Highlighter. Example: if the request is “skin doctor delhi”, and the document called “dermatologist” contains the words “skin” and “doctor”, then after highlighting the label, I should be able to separate the “skin” and “doctor” from the request. I have been trying to write code for this for several weeks. Unable to get what I want. could you help me?

Thanks in advance.

Update:

Current Approach: I am creating a query containing all the words in a document.

Field[] field = doc.getFields("description");
String desc = "";
for (int j = 0; j < field.length; ++j) {
     desc += field[j].stringValue() + " ";
}

Query q = qp.parse(desc);
QueryScorer scorer = new QueryScorer(q, reader, "description");
Highlighter highlighter = new Highlighter(scorer);

String fragment = highlighter.getBestFragment(analyzer, "description", text);

It works for small documents, but does not work for large documents. It turns out the next stack.

    org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024
    at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:152)
    at org.apache.lucene.queryParser.QueryParser.getBooleanQuery(QueryParser.java:891)
    at org.apache.lucene.queryParser.QueryParser.getBooleanQuery(QueryParser.java:866)
    at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1213)
    at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1167)
    at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:182)

Obviously, this approach is not justified for large documents. What needs to be done to fix this?

BTW I am using FuzzyQuery matching.

+3
source share
1 answer

EDIT: added some explanation details ().

Some general information: Lucene Highlighter is designed to search for fragments of text from a document with a stroke, and select tokens that match the query.

  • Therefore, the TokenStream parameter is used to split the text of the strike into tokens. The shortcut marker then evaluates each token to clog fragments and select fragments and tokens to highlight.
  • , . , , , , explain(). , , , :

Explanation expl = searcher.explain(query, docId);

String asText = expl.toString();

String asHtml = expl.toHtml();

docId .

/ , . , . , , API Lucene 2.4.1. , "QueryScorer", "SpanScorer".

+1

Source: https://habr.com/ru/post/1736125/


All Articles