How to analyze and index different parts of an HTML page using Tika & Lucene?

I am trying to parse and index different parts of an HTML page using Lucene and Tika. E.g. I would like to index the text in the Title, H1, H2, A tags of the HTML page separately and provide various enhancements for each of them. I use Tika to parse HTML and create a Document object with the appropriate fields to be indexed. However, I could not find anything in Tika that would help me index the tags that I want right out of the box.

My code looks something like this:

 InputStream is = new FileInputStream(f); 
 Parser parser = new AutoDetectParser(); 
 ContentHandler handler = new BodyContentHandler(-1);
 ParseContext context = new ParseContext(); 
  context.set(HtmlMapper.class, DefaultHtmlMapper.INSTANCE); 

 try {
  parser.parse(is, handler, metadata, context);
 } finally {
  is.close();
 }

 Document doc = new Document();
 doc.add(new Field("contents", handler.toString(),
   Field.Store.NO, Field.Index.ANALYZED));

 for (String name : metadata.names()) {
  String value = metadata.get(name);

  if (textualMetadataFields.contains(name)) {
   doc.add(new Field("contents", value,
     Field.Store.NO, Field.Index.ANALYZED));
  }

  doc.add(new Field(name, value, Field.Store.YES, Field.Index.YES));
 }

Entering the Tika HTML parsing code, I discovered that it is the org.apache.tika.parser.html.HtmlHandler class that populates the metadata object.

HTML-, HtmlHandler? - Tika, HTML, ? -, , , ?

+3
1

Um. ? , , ;-)

0

Source: https://habr.com/ru/post/1784523/


All Articles