Extract and clean HTML snippet using HTML Parser (org.htmlparser)

I am looking for an effective approach to extracting an HTML fragment from a web page and performing certain operations on this HTML fragment.

Required Operations:

  • Remove all tags that have a hidden class.
  • Remove all script tags
  • Remove all style tags
  • Delete all event attributes (by * = "*")
  • Delete all style attributes

I used HTML Parser (org.htmlparser) for this task and was able to satisfy all the requirements, however, I do not feel like I have an elegant solution. I am currently parsing the CssSelectorNodeFilter webpage (to get a fragment), and then re-parsing that fragment using NodeVisitor to perform cleanup operations.

Can anyone suggest how they will deal with this problem? I would prefer only to parse the document once and perform all operations during this parsing.

Thanks in advance!

+6
source share
1 answer

Check out jsoup - it should handle all your necessary tasks in an elegant way.

[change]

Here is a complete working example for each required operation:

// Load and parse the document fragment. File f = new File("myfile.html"); // See also Jsoup#parseBodyFragment(s) Document doc = Jsoup.parse(f, "UTF-8", "http://example.com"); // Remove all script and style elements and those of class "hidden". doc.select("script, style, .hidden").remove(); // Remove all style and event-handler attributes from all elements. Elements all = doc.select("*"); for (Element el : all) { for (Attribute attr : el.attributes()) { String attrKey = attr.getKey(); if (attrKey.equals("style") || attrKey.startsWith("on")) { el.removeAttr(attrKey); } } } // See also - doc.select("*").removeAttr("style"); 

You want things like case sensitivity not to matter for attribute names, but that should be most of what you need.

+6
source

Source: https://habr.com/ru/post/902934/


All Articles