How to parse xhtml ignoring DOCTYPE declaration using DOM parser

I ran into the problem of parsing xhtml with a DOCTYPE declaration using a DOM parser.

Error: java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd%20

Declaration: DOCTYPE html PUBLIC "- // W3C // DTD XHTML 1.0 Transitional // EN" " http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

Is there a way to parse xhtml for a Document object, ignoring the DOCTYPE declaration.

+4
source share
4 answers

The solution that works for me is to provide a DocumentBuilder with a fake Resolver that returns an empty stream. There is a good explanation here (look at the last post from kdgregory)

http://forums.sun.com/thread.jspa?threadID=5362097

here is the kdgregory solution:

documentBuilder.setEntityResolver(new EntityResolver() { public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException { return new InputSource(new StringReader("")); } }); 
+4
source

A parser is required to load the DTD, but you can bypass it by setting the standalone attribute in the <?xml... ?> Line.

Note, however, that this particular error is most likely caused by confusion between XML schema definitions and DTD URLs. See http://www.w3schools.com/xhtml/xhtml_dtd.asp for more details. Correct:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
+1
source

The easiest way to do this is to set validating = false in your DocumentBuilderFactory. If you want to do validation, download the DTD and use a local copy. As Rachel commented above, this is discussed in the WWW Consortium.

In short, since by default DocumentBuilderFactory loads a DTD every time it checks, W3 got hit every time a typical programmer tried to parse an XHTML file in Java. They cannot afford such traffic, so they respond with an error.

+1
source

Instead of a fake converter, the following code snippet instructs the parser to really ignore the external DTD from the DOCTYPE declaration:

 import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import org.w3c.dom.Document; (...) DocumentBuilderFactory f = DocumentBuilderFactory.newInstance(); f.setValidating(false); f.setAttribute("http://apache.org/xml/features/nonvalidating/load-external-dtd", false); DocumentBuilder builder = f.newDocumentBuilder(); Document document = builder.parse( ... ) 
0
source

Source: https://habr.com/ru/post/1306970/


All Articles