How can I create HTML org.w3c.dom.Document?

Question

How can I create HTML org.w3c.dom.Document?

Interface Document describes the interface as:

The Document interface represents an entire HTML or XML document.

javax.xml.parsers.DocumentBuilder builds an XML Document s. However, I cannot find a way to create a Document , which is an HTML Document !

I want an HTML Document because I'm trying to create a document, which I then pass to a library that expects an HTML Document . This library uses Document#getElementsByTagName(String tagname) in a case-insensitive manner, which is great for HTML but not XML.

I looked around and found nothing. Elements like How to convert the source of an HTML page in org.w3c.dom.Document in java? there really is no answer.

+6

java dom html xml

Dmitry Minkovsky Mar 13 '15 at 21:00

source share

1 answer

dbank · Accepted Answer · 2015-03-16T07:28:32+0000

You seem to have two explicit requirements:

You need to submit HTML as org.w3c.dom.Document .
You need Document#getElementsByTagName(String tagname) to work case insensitive.

If you are trying to work with HTML using org.w3c.dom.Document , I assume that you are working with some XHTML flavor. Because an XML API such as the DOM will expect well-formed XML. HTML is not necessarily well-formed XML, but XHTML is well-formed XML. Even if you work with HTML, you will need to do some preprocessing to make sure that it is well-formed XML before trying to run it through an XML parser. Maybe it’s just easier to parse the HTML first with an HTML parser like jsoup , then build org.w3c.dom.Document , going through the HTML parser produced a tree ( org.jsoup.nodes.Document in case of jsoup).

There is an org.w3c.dom.html.HTMLDocument interface extending org.w3c.dom.Document . The only thing I found was in Xerces-j (2.11.0) as org.apache.html.dom.HTMLDocumentImpl . At first, this seems promising, but upon closer inspection we find that there are some problems.

1. There is no clear, "clean" way to get an instance of an object that implements the org.w3c.dom.html.HTMLDocument interface.

With Xerces, we usually get a Document object using a DocumentBuilder as follows:

 DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.newDocument(); //or doc = builder.parse(xmlFile) if parsing from a file

Or using the DOMImplementation sort:

 DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance(); DOMImplementationLS impl = (DOMImplementationLS)registry.getDOMImplementation("LS"); LSParser lsParser = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null); Document document = lsParser.parseURI("myFile.xml");

In both cases, we use only the org.w3c.dom.* Interfaces to get the Document object.

The closest equivalent found for HTMLDocument was something like this:

 HTMLDOMImplementation htmlDocImpl = HTMLDOMImplementationImpl.getHTMLDOMImplementation(); HTMLDocument htmlDoc = htmlDocImpl.createHTMLDocument("My Title");

This requires us to instantly instantiate the inner classes of the implementation, which makes the implementation an implementation dependent on Xerces.

(Note: I also saw that Xerces also has an internal HTMLBuilder (which implements the deprecated DocumentHandler ), which can supposedly generate HTMLDocument using a SAX parser, but I didn't bother looking into it. )

2. org.w3c.dom.html.HTMLDocument does not generate the correct XHTML.

Although you can search the HTMLDocument tree with getElementsByTagName(String tagname) case insensitive, all element names are stored inside ALL CAPS. But XHTML elements and attribute names must be in all lowercase letters . (This could be handled by going through the entire document tree and using the Document renameNode() method to change all element names to lower case.)

In addition, it is assumed that the XHTML document must have a proper DOCTYPE declaration and xmlns declaration for the XHTML namespace . There seems to be no easy way to install those that are in HTMLDocument (unless you can handle the internal Xerces implementations).

3. org.w3c.dom.html.HTMLDocument has a little documentation, and the implementation of the Xerces interface seems incomplete.

I did not browse the entire Internet, but the only documentation I found for HTMLDocument was the previously linked JavaDocs and comments in the source code of the internal Xerces implementation. In these comments, I also found notes that several different parts of the interface were not implemented. (Sidenote: I really got the impression that the org.w3c.dom.html.HTMLDocument interface itself is really not used by anyone and is probably incomplete.)

For these reasons, I think it's best to avoid org.w3c.dom.html.HTMLDocument and just do what we can with org.w3c.dom.Document . What we can do?

Well, one approach is to extend org.apache.xerces.dom.DocumentImpl (which extends org.apache.xerces.dom.CoreDocumentImpl , which implements org.w3c.dom.Document ). This approach does not require a lot of code, but it still makes us Xerces dependent as we extend DocumentImpl . In our MyHTMLDocumentImpl we simply convert all tag names to lower case when creating and searching for items. This will allow the use of Document#getElementsByTagName(String tagname) in a case-insensitive manner.

MyHTMLDocumentImpl :

 import org.apache.xerces.dom.DocumentImpl; import org.apache.xerces.dom.DocumentTypeImpl; import org.w3c.dom.DOMException; import org.w3c.dom.Document; import org.w3c.dom.DocumentType; import org.w3c.dom.Element; import org.w3c.dom.Node; import org.w3c.dom.NodeList; //a base class somewhere in the hierarchy implements org.w3c.dom.Document public class MyHTMLDocumentImpl extends DocumentImpl { private static final long serialVersionUID = 1658286253541962623L; /** * Creates an Document with basic elements required to meet * the <a href="http://www.w3.org/TR/xhtml1/#strict">XHTML standards</a>. * <pre> * {@code * <?xml version="1.0" encoding="UTF-8"?> * <!DOCTYPE html * PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" * "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> * <html xmlns="http://www.w3.org/1999/xhtml"> * <head> * <title>My Title</title> * </head> * <body/> * </html> * } * </pre> * * @param title desired text content for title tag. If null, no text will be added. * @return basic HTML Document. */ public static Document makeBasicHtmlDoc(String title) { Document htmlDoc = new MyHTMLDocumentImpl(); DocumentType docType = new DocumentTypeImpl(null, "html", "-//W3C//DTD XHTML 1.0 Strict//EN", "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"); htmlDoc.appendChild(docType); Element htmlElement = htmlDoc.createElementNS("http://www.w3.org/1999/xhtml", "html"); htmlDoc.appendChild(htmlElement); Element headElement = htmlDoc.createElement("head"); htmlElement.appendChild(headElement); Element titleElement = htmlDoc.createElement("title"); if(title != null) titleElement.setTextContent(title); headElement.appendChild(titleElement); Element bodyElement = htmlDoc.createElement("body"); htmlElement.appendChild(bodyElement); return htmlDoc; } /** * This method will allow us to create a our * MyHTMLDocumentImpl from an existing Document. */ public static Document createFrom(Document doc) { Document htmlDoc = new MyHTMLDocumentImpl(); DocumentType originDocType = doc.getDoctype(); if(originDocType != null) { DocumentType docType = new DocumentTypeImpl(null, originDocType.getName(), originDocType.getPublicId(), originDocType.getSystemId()); htmlDoc.appendChild(docType); } Node docElement = doc.getDocumentElement(); if(docElement != null) { Node copiedDocElement = docElement.cloneNode(true); htmlDoc.adoptNode(copiedDocElement); htmlDoc.appendChild(copiedDocElement); } return htmlDoc; } private MyHTMLDocumentImpl() { super(); } @Override public Element createElement(String tagName) throws DOMException { return super.createElement(tagName.toLowerCase()); } @Override public Element createElementNS(String namespaceURI, String qualifiedName) throws DOMException { return super.createElementNS(namespaceURI, qualifiedName.toLowerCase()); } @Override public NodeList getElementsByTagName(String tagname) { return super.getElementsByTagName(tagname.toLowerCase()); } @Override public NodeList getElementsByTagNameNS(String namespaceURI, String localName) { return super.getElementsByTagNameNS(namespaceURI, localName.toLowerCase()); } @Override public Node renameNode(Node n, String namespaceURI, String qualifiedName) throws DOMException { return super.renameNode(n, namespaceURI, qualifiedName.toLowerCase()); } }

Tester:

 import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStream; import org.w3c.dom.DOMConfiguration; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.NodeList; import org.w3c.dom.bootstrap.DOMImplementationRegistry; import org.w3c.dom.ls.DOMImplementationLS; import org.w3c.dom.ls.LSOutput; import org.w3c.dom.ls.LSSerializer; public class HTMLDocumentTest { private final static int P_ELEMENT_NUM = 3; public static void main(String[] args) //I'm throwing all my exceptions here to shorten the example, but obviously you should handle them appropriately. throws ClassNotFoundException, InstantiationException, IllegalAccessException, ClassCastException, IOException { Document htmlDoc = MyHTMLDocumentImpl.makeBasicHtmlDoc("My Title"); //populate the html doc with some example content Element bodyElement = (Element) htmlDoc.getElementsByTagName("body").item(0); for(int i = 0; i < P_ELEMENT_NUM; ++i) { Element pElement = htmlDoc.createElement("p"); String id = Integer.toString(i+1); pElement.setAttribute("id", "anId"+id); pElement.setTextContent("Here is some text"+id+"."); bodyElement.appendChild(pElement); } //get the title element in a case insensitive manner. NodeList titleNodeList = htmlDoc.getElementsByTagName("tItLe"); for(int i = 0; i < titleNodeList.getLength(); ++i) System.out.println(titleNodeList.item(i).getTextContent()); System.out.println(); {//get all p elements searching with lowercase NodeList pNodeList = htmlDoc.getElementsByTagName("p"); for(int i = 0; i < pNodeList.getLength(); ++i) { System.out.println(pNodeList.item(i).getTextContent()); } } System.out.println(); {//get all p elements searching with uppercase NodeList pNodeList = htmlDoc.getElementsByTagName("P"); for(int i = 0; i < pNodeList.getLength(); ++i) { System.out.println(pNodeList.item(i).getTextContent()); } } System.out.println(); //to serialize DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance(); DOMImplementationLS domImplLS = (DOMImplementationLS) registry.getDOMImplementation("LS"); LSSerializer lsSerializer = domImplLS.createLSSerializer(); DOMConfiguration domConfig = lsSerializer.getDomConfig(); domConfig.setParameter("format-pretty-print", true); //if you want it pretty and indented LSOutput lsOutput = domImplLS.createLSOutput(); lsOutput.setEncoding("UTF-8"); //to write to file try (OutputStream os = new FileOutputStream(new File("myFile.html"))) { lsOutput.setByteStream(os); lsSerializer.write(htmlDoc, lsOutput); } //to print to screen System.out.println(lsSerializer.writeToString(htmlDoc)); } }

Output:

 My Title Here is some text1. Here is some text2. Here is some text3. Here is some text1. Here is some text2. Here is some text3. <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>My Title</title> </head> <body> <p id="anId1">Here is some text1.</p> <p id="anId2">Here is some text2.</p> <p id="anId3">Here is some text3.</p> </body> </html>

Another approach, similar to the above, is to instead create a Document wrapper that wraps the Document object and implements the Document interface itself. This requires more code than the "extensible DocumentImpl " approach, but this way is "cleaner" because we do not need to worry about specific implementations of the Document . Additional code for this approach is not difficult; it's just a little tedious to provide all of these shell implementations for Document methods. I have not completely dealt with this yet, and there may be some problems, but if it works, this is a general idea:

 public class MyHTMLDocumentWrapper implements Document { private Document doc; public MyHTMLDocumentWrapper(Document doc) { //... this.doc = doc; //... } //... }

Is this org.w3c.dom.html.HTMLDocument one of the approaches mentioned above, or something else, maybe these suggestions will help you understand how to proceed.

Edit:

In my parsing tests, trying to parse the following XHTML file, Xerces hangs in the entity management class, trying to open an http connection. Why i do not know? Moreover, I tested the local html file without any entities. (Maybe something is related to DOCTYPE or namespace?) This is a document:

 <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>My Title</title> </head> <body> <p id="anId1">Here is some text1.</p> <p id="anId2">Here is some text2.</p> <p id="anId3">Here is some text3.</p> </body> </html>

How can I create HTML org.w3c.dom.Document?

More articles: