How to convert webpage HTML source to org.w3c.dom.Document in java?

How to convert webpage HTML source to org.w3c.dom.Documentin Java?

+2
source share
3 answers

This is actually quite difficult to do, because arbitrary HTML pages are sometimes distorted (mainstream browsers are pretty tolerant). You can look in the swing html parser , which I have never tried, but it seems like this is the best option. You can also try something along these lines and handle any parsing exceptions that may occur (although I only tried this for xml):

import java.io.File; import org.w3c.dom.Document; import org.w3c.dom.*; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.DocumentBuilder; import org.xml.sax.SAXException; import org.xml.sax.SAXParseException; ... try { DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder(); Document doc = docBuilder.parse (InputStreamYouBuiltEarlierFromAnHTTPRequest); } catch (ParserConfigurationException e) { ... } catch (SAXException e) { ... } catch (IOException e) { ... } ... 
+1
source

I suggest http://about.validator.nu/htmlparser/ , which implements the HTML5 parsing algorithm. Firefox is in the process of replacing its own HTML parser with this.

+2
source

I just played with JSoup , which is a fantastic Java HTML parser that looks a bit like jQuery. Really easy to use.

+2
source

Source: https://habr.com/ru/post/983774/


All Articles