I remember how I prepared the plugin for parsing HTML files for past work. I do not have access to how I did this, but here are the main points. We wanted to do the following:
- parse an HTML page, but conditionally use the H1 tag or a tag with a specific class as the page title, not the actual // html / head / title
- There were some special pieces of data that were sometimes on the page (that is, which tab was selected, which would tell us whether it was a retail client, a bank client or a corporate client).
- and etc.
, , html-parse ( ) . . super, DOM, , . , , super.
, . , , , DOM ? nutch (http://grepcode.com/snapshot/repo1.maven.org/maven2/org.apache.nutch/nutch/1.3/) , ( , ). , .
, .