This is actually quite difficult to do, because arbitrary HTML pages are sometimes distorted (mainstream browsers are pretty tolerant). You can look in the swing html parser , which I have never tried, but it seems like this is the best option. You can also try something along these lines and handle any parsing exceptions that may occur (although I only tried this for xml):
import java.io.File; import org.w3c.dom.Document; import org.w3c.dom.*; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.DocumentBuilder; import org.xml.sax.SAXException; import org.xml.sax.SAXParseException; ... try { DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder(); Document doc = docBuilder.parse (InputStreamYouBuiltEarlierFromAnHTTPRequest); } catch (ParserConfigurationException e) { ... } catch (SAXException e) { ... } catch (IOException e) { ... } ...
source share