Extract and clean HTML snippet using HTML Parser (org.htmlparser)

Question

Extract and clean HTML snippet using HTML Parser (org.htmlparser)

I am looking for an effective approach to extracting an HTML fragment from a web page and performing certain operations on this HTML fragment.

Required Operations:

Remove all tags that have a hidden class.
Remove all script tags
Remove all style tags
Delete all event attributes (by * = "*")
Delete all style attributes

I used HTML Parser (org.htmlparser) for this task and was able to satisfy all the requirements, however, I do not feel like I have an elegant solution. I am currently parsing the CssSelectorNodeFilter webpage (to get a fragment), and then re-parsing that fragment using NodeVisitor to perform cleanup operations.

Can anyone suggest how they will deal with this problem? I would prefer only to parse the document once and perform all operations during this parsing.

Thanks in advance!

+6

java html-parsing software-design

Kieran hall Dec 02 '11 at 14:30

source share

1 answer

maerics · Accepted Answer · 2011-12-02T15:16:05+0000

Check out jsoup - it should handle all your necessary tasks in an elegant way.

[change]

Here is a complete working example for each required operation:

// Load and parse the document fragment. File f = new File("myfile.html"); // See also Jsoup#parseBodyFragment(s) Document doc = Jsoup.parse(f, "UTF-8", "http://example.com"); // Remove all script and style elements and those of class "hidden". doc.select("script, style, .hidden").remove(); // Remove all style and event-handler attributes from all elements. Elements all = doc.select("*"); for (Element el : all) { for (Attribute attr : el.attributes()) { String attrKey = attr.getKey(); if (attrKey.equals("style") || attrKey.startsWith("on")) { el.removeAttr(attrKey); } } } // See also - doc.select("*").removeAttr("style");

You want things like case sensitivity not to matter for attribute names, but that should be most of what you need.

Extract and clean HTML snippet using HTML Parser (org.htmlparser)

More articles: