Effectively merge several large XML files into one

I searched the web and I searched stackoverflow up and down. There is no decision. Although I found solutions how to do this in pure xslt here .

But the problem is that the resulting xml will have several hundred megabytes. So I have to do it using SAX in Java. (please do not xslt solution, although I marked it with xslt ;-))

Let me explain in more detail. I have several few xml files (preferably InputSteam) that need to be parsed. Files or InputStream look like

inputstream1

<root> <doc> <tag>test1</tag> </doc> <doc> <tag>test2</tag> </doc> ... </root> 

inputstream2

 <root> <doc> <tag>test3</tag> </doc> <doc> <tag>test4</tag> </doc> ... </root> 

inputstream1 + inputstream2 + ... + inputstreamN = xml result . It will look like

 <root> <doc> <tag>test1</tag> </doc> <doc> <tag>test2</tag> </doc> ... <doc> <tag>test3</tag> </doc> <doc> <tag>test4</tag> </doc> ... </root> 

Does anyone have a solution or link for this? Can this be implemented using a custom InputSource or using a custom ContentHandler? Or is this possible with joost / stx ?

The good thing, if I can use the ContentHandler, would be that I could apply some minor transformations (I already implemented this). But then the problem is that I don’t know a way to transfer multiple files or an InputStream as an InputSource:

 XMLReader xmlReader = XMLReaderFactory.createXMLReader(); xmlReader.setContentHandler(customHandler); xmlReader.parse(getInputSource()); // only one InputStream possible 

or should I parse InputStreams directly in my ContentHandler?

+4
source share
4 answers

I finally dealt with this with the following snippet:

  finalHandler = new StreamResult(new OutputStreamWriter(System.out)); // customHandler extends DefaultHandler CustomTransformerHandler customHandler = new CustomTransformerHandler( finalHandler); customHandler.startDocumentExplicitly(); InputStream is = null; while ((is = customHandler.createNextInputStream()) != null) { // multiple inputStream parsing XMLReader myReader = XMLReaderFactory.createXMLReader(); myReader.setContentHandler(customHandler); myReader.parse(new InputSource(is)); } customHandler.endDocumentExplicitly(); 

The important part was to leave empty startDocument and endDocument methods. All other methods (characters, startElement, endElement) will be redirected to finalHandler. The customHandler.createNextInputStream method returns null if all input streams are read.

0
source

I didn’t do this myself, but I remembered seeing an IBM developerWorks article that looked like it was pretty easy.

Now this is a bit outdated, but try http://www.ibm.com/developerworks/xml/library/x-tipstx5/index.html

This is StAX instead of SAX. I'm not sure JDKs currently include StAX. If not, you can get it from http://stax.codehaus.org/

+2
source

You might want to take a look at the version of the Saxon version. It can handle XSLT on the fly without requiring a full DOM in memory.

+1
source

The most efficient way to merge files is to use the byte level cut and paste function offered by VTD-XML , AFAIK. You take both files, parse them into VTDNav objects, then create an instance of the XMLModifier object, grab fragments from the second file and paste them into the first file ... which should be much more efficient than SAX. The resulting XML also receives the written direction to the file - there is no need to store it in memory. Below is the complete code in less than 20 lines ...

 import com.ximpleware.*; import java.io.*; public class merge { // merge second.xml into first.xml assuming the same encoding public static void main(String[] s) throws VTDException, IOException{ VTDGen vg = new VTDGen(); if (!vg.parseFile("d:\\xml\\first.xml", false)) return; VTDNav vn1=vg.getNav(); if(!vg.parseFile("d:\\xml\\second.xml", false)) return; VTDNav vn2 = vg.getNav(); XMLModifier xm = new XMLModifier(vn1); long l = vn2.getContentFragment(); xm.insertBeforeTail(vn2, l); xm.output("d:\\xml\\merged.xml"); } } 
0
source

Source: https://habr.com/ru/post/1301484/


All Articles