Parsing html data with nutch 1.0 and a custom plugin

I'm currently trying to write my own plugin for nutch 1.0. This plugin should analyze html data and filter out the relevant information from documents. The main plugin works for me, it extends the HtmlParserResult object and is executed every time I do the parsing.

My problems are currently two:

  • I don’t understand how well the nut analysis workflow / pipeline works. I do not find information on this on the nutch website.

  • I don’t understand how DOM parsing is done, I see that Nutch has a lot of DOM objects and that the HtmlParser plugin does some DOM parsing, but I still haven’t figured out how best to do this.

+3
source share
1 answer

I remember how I prepared the plugin for parsing HTML files for past work. I do not have access to how I did this, but here are the main points. We wanted to do the following:

  • parse an HTML page, but conditionally use the H1 tag or a tag with a specific class as the page title, not the actual // html / head / title
  • There were some special pieces of data that were sometimes on the page (that is, which tab was selected, which would tell us whether it was a retail client, a bank client or a corporate client).
  • and etc.

, , html-parse ( ) . . super, DOM, , . , , super.

, . , , , DOM ? nutch (http://grepcode.com/snapshot/repo1.maven.org/maven2/org.apache.nutch/nutch/1.3/) , ( , ). , .

, .

+1

Source: https://habr.com/ru/post/1707291/


All Articles