How to convert webpage HTML source to org.w3c.dom.Document in java?

Question

How to convert webpage HTML source to org.w3c.dom.Document in java?

How to convert webpage HTML source to org.w3c.dom.Documentin Java?

+2

Yatendra goel Feb 19 '10 at 16:34

3 answers

I suggest http://about.validator.nu/htmlparser/ , which implements the HTML5 parsing algorithm. Firefox is in the process of replacing its own HTML parser with this.

+2

Ms2ger Feb 19 '10 at 18:13

source share

I just played with JSoup , which is a fantastic Java HTML parser that looks a bit like jQuery. Really easy to use.

+2

DisgruntledGoat Feb 21 '10 at 23:58

source share

Seth · Accepted Answer · 2010-02-19T17:10:26+0000

This is actually quite difficult to do, because arbitrary HTML pages are sometimes distorted (mainstream browsers are pretty tolerant). You can look in the swing html parser , which I have never tried, but it seems like this is the best option. You can also try something along these lines and handle any parsing exceptions that may occur (although I only tried this for xml):

import java.io.File; import org.w3c.dom.Document; import org.w3c.dom.*; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.DocumentBuilder; import org.xml.sax.SAXException; import org.xml.sax.SAXParseException; ... try { DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder(); Document doc = docBuilder.parse (InputStreamYouBuiltEarlierFromAnHTTPRequest); } catch (ParserConfigurationException e) { ... } catch (SAXException e) { ... } catch (IOException e) { ... } ...

How to convert webpage HTML source to org.w3c.dom.Document in java?

More articles: