I am trying to parse and index different parts of an HTML page using Lucene and Tika. E.g. I would like to index the text in the Title, H1, H2, A tags of the HTML page separately and provide various enhancements for each of them. I use Tika to parse HTML and create a Document object with the appropriate fields to be indexed. However, I could not find anything in Tika that would help me index the tags that I want right out of the box.
My code looks something like this:
InputStream is = new FileInputStream(f);
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(-1);
ParseContext context = new ParseContext();
context.set(HtmlMapper.class, DefaultHtmlMapper.INSTANCE);
try {
parser.parse(is, handler, metadata, context);
} finally {
is.close();
}
Document doc = new Document();
doc.add(new Field("contents", handler.toString(),
Field.Store.NO, Field.Index.ANALYZED));
for (String name : metadata.names()) {
String value = metadata.get(name);
if (textualMetadataFields.contains(name)) {
doc.add(new Field("contents", value,
Field.Store.NO, Field.Index.ANALYZED));
}
doc.add(new Field(name, value, Field.Store.YES, Field.Index.YES));
}
Entering the Tika HTML parsing code, I discovered that it is the org.apache.tika.parser.html.HtmlHandler class that populates the metadata object.
HTML-, HtmlHandler?
- Tika, HTML, ?
-, , , ?