Parsing Invalid / Incomplete / Invalid XML Files

Question

Parsing Invalid / Incomplete / Invalid XML Files

I have a process that parses an XML file using the JDOM and xpath to parse the file, as shown below:

private static SAXBuilder builder = null; private static Document doc = null; private static XPath xpathInstance = null; builder = new SAXBuilder(); Text list = null; try { doc = builder.build(new StringReader(xmldocument)); } catch (JDOMException e) { throw new Exception(e); } try { xpathInstance = XPath.newInstance("//book[author='Neal Stephenson']/title/text()"); list = (Text) xpathInstance.selectSingleNode(doc); } catch (JDOMException e) { throw new Exception(e); }

The above works fine. Xpath expressions are stored in a properties file so that they can be modified at any time. Now I need to process some more xml files that come from the old system, which will send XML files of only 4000 bytes. Existing processing reads 4,000 byte blocks and stores them in the Oracle database with each fragment as a single row in the database (making changes to an outdated system or processing in which pieces are stored as strings in the database are out of the question),

I can create a complete valid XML document by extracting all the lines associated with a particular XML document and merging them, and then using the existing processing (shown above) to parse the XML document.

The thing is, the data I need to extract from an XML document will always be on the first 4000 bytes. This snippet is not a valid XML document because it will be incomplete, but it will contain all the data I need. I can't make out just one piece, as the JDOM developer will reject it.

I am wondering if I can parse the wrong XML fragment without having to combine all the parts (which could get quite a lot) to get a valid XML document. This will save me a few trips to the database to check if a chunk is available, and I don’t have to merge 100 chunks just to use the first 4000 bytes.

I know that maybe I could use java string functions to retrieve the relevant data, but is this possible with a parser or even xpath? or do they both expect the XML document to be a well-formed document before it can parse it?

+6

java xml parsing xpath jdom

ziggy Aug 08 '11 at 12:21

source share

1 answer

Vlad · Accepted Answer · 2011-08-08T12:27:39+0000

You can try using JSoup to parse invalid XML. By definition, XML must be well-formed, otherwise it is invalid and should not be used.

UPDATE - example:

 public static void main(String[] args) { for (Node node : Parser.parseFragment("<test><author name=\"Vlad\"><book name=\"SO\"/>" , new Element(Tag.valueOf("p"), ""), "")) { print(node, 0); } } public static void print(Node node, int offset) { for (int i = 0; i < offset; i++) { System.out.print(" "); } System.out.print(node.nodeName()); for (Attribute attribute: node.attributes()) { System.out.print(", "); System.out.print(attribute.getKey() + "=" + attribute.getValue()); } System.out.println(); for (Node child : node.childNodes()) { print(child, offset + 4); } }

Parsing Invalid / Incomplete / Invalid XML Files

More articles: