Lucene highlighter

How does the Lucene 4.3.1 marker work? I want to print the search results (as the search word and 8 words after this word) from the document. How can I use the Highlighter class to do this? I added the full txt, html and xml documents to the file and added them to my index, now I have a search formula from which I will probably add marker capabilities:

String index = "index"; String field = "contents"; String queries = null; int repeat = 1; boolean raw = true; //not sure what raw really does??? String queryString = null; //keep null, prompt user later for it int hitsPerPage = 10; //leave it at 10, go from there later //need to add all files to same directory index = "C:\\Users\\plib\\Documents\\index"; repeat = 4; IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index))); IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43); BufferedReader in = null; if (queries != null) { in = new BufferedReader(new InputStreamReader(new FileInputStream(queries), "UTF-8")); } else { in = new BufferedReader(new InputStreamReader(System.in, "UTF-8")); } QueryParser parser = new QueryParser(Version.LUCENE_43, field, analyzer); while (true) { if (queries == null && queryString == null) { // prompt the user System.out.println("Enter query. 'quit' = quit: "); } String line = queryString != null ? queryString : in.readLine(); if (line == null || line.length() == -1) { break; } line = line.trim(); if (line.length() == 0 || line.equalsIgnoreCase("quit")) { break; } Query query = parser.parse(line); System.out.println("Searching for: " + query.toString(field)); if (repeat > 0) { // repeat & time as benchmark Date start = new Date(); for (int i = 0; i < repeat; i++) { searcher.search(query, null, 100); } Date end = new Date(); System.out.println("Time: "+(end.getTime()-start.getTime())+"ms"); } doPagingSearch(in, searcher, query, hitsPerPage, raw, queries == null && queryString == null); if (queryString != null) { break; } } reader.close(); 

}

+4
source share
2 answers

For the Lucene marker to work, you need to add two fields to your document that you are indexing. One field must be with Vector Vector enabled and another field without using Term Vector. For simplicity, I will show you a piece of code:

  FieldType type = new FieldType(); type.setIndexed(true); type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); type.setStored(true); type.setStoreTermVectors(true); type.setTokenized(true); type.setStoreTermVectorOffsets(true); Field field = new Field("content", "This is fragment. Highlters", type); doc.add(field); //this field has term Vector enabled. //without term vector enabled. doc.add(new StringField("ncontent","This is fragment. Highlters", Field.Store.YES)); 

After enabling them, add this document to your index. Now, to use the lucene marker, use the method below (it uses Lucene 4.2, I have not tested it with Lucene 4.3.1):

  public void highLighter() throws IOException, ParseException, InvalidTokenOffsetsException { IndexReader reader = DirectoryReader.open(FSDirectory.open(new File("INDEXDIRECTORY"))); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_42); IndexSearcher searcher = new IndexSearcher(reader); QueryParser parser = new QueryParser(Version.LUCENE_42, "content", analyzer); Query query = parser.parse("Highlters"); //your search keyword TopDocs hits = searcher.search(query, reader.maxDoc()); System.out.println(hits.totalHits); SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter(); Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query)); for (int i = 0; i < reader.maxDoc(); i++) { int id = hits.scoreDocs[i].doc; Document doc = searcher.doc(id); String text = doc.get("ncontent"); TokenStream tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), id, "ncontent", analyzer); TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 4); for (int j = 0; j < frag.length; j++) { if ((frag[j] != null) && (frag[j].getScore() > 0)) { System.out.println((frag[j].toString())); } } //Term vector text = doc.get("content"); tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), hits.scoreDocs[i].doc, "content", analyzer); frag = highlighter.getBestTextFragments(tokenStream, text, false, 10); for (int j = 0; j < frag.length; j++) { if ((frag[j] != null) && (frag[j].getScore() > 0)) { System.out.println((frag[j].toString())); } } System.out.println("-------------"); } } 
+7
source

I had the same question and finally came across this message.

http://vnarcher.blogspot.ca/2012/04/highlighting-text-with-lucene.html

The key part is that when repeating your results, call getHighlightedField on the value of the result you want to highlight.

 private String getHighlightedField(Query query, Analyzer analyzer, String fieldName, String fieldValue) throws IOException, InvalidTokenOffsetsException { Formatter formatter = new SimpleHTMLFormatter("<span class="\"MatchedText\"">", "</span>"); QueryScorer queryScorer = new QueryScorer(query); Highlighter highlighter = new Highlighter(formatter, queryScorer); highlighter.setTextFragmenter(new SimpleSpanFragmenter(queryScorer, Integer.MAX_VALUE)); highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE); return highlighter.getBestFragment(this.analyzer, fieldName, fieldValue); } 

In this case, it is assumed that the output will be html, and it just wraps the selected text using <span> using the css MatchedText class. You can then define a custom css rule to do whatever you want to highlight.

+6
source

Source: https://habr.com/ru/post/1490369/


All Articles