Parsing very large XML files and sorting by Java objects

I have the following problem: I have very large XML files (e.g. 300+ Megs) and I need to parse them to add some of their values ​​to db. The structure of these files is also very complex. I want to use Stax Parser, since it offers a good opportunity to parse (and therefore process) only parts of an XML file at a time, and thus not load all this into memory, but, on the other hand, get values ​​using Stax ( at least in these XML files) is cumbersome, I need to write a ton of code. From this last point of view, it will help me a lot if I can combine the XML file with Java objects (for example, JAX-B), however it will immediately load the whole file plus a ton of Object instances in memory.

My question is, is there a way to pull out (or just partially parse) a file sequentially and then marshall only those parts of Java objects so that I can handle them easily without clogging up memory?

+6
source share
3 answers

Well, firstly, I want to thank two people who answer my questions, but I did not use these suggestions in the end, partly because the proposed technologies are a bit far from Java, let them say "standard XML parsing", and it feels weird as long as there is a similar tool already present in Java, and partly also because I actually found a solution that uses the Java API to accomplish this.

I will not describe in detail the solution that I found, because I have already completed the implementation, and here is a rather large piece of code (I use Spring Batch on top of everything, with a ton configuration, etc.).

However, I will make a short comment about the fact that I finally finished:

The big idea here is that if you have an XML document and an appropriate XSD schema, you can parse and sort it with JAXB, and you can do it in chunks, and these chunks can be read using an even parser such as STAX and then passed to JAXB marshaller.

This practically means that you must first decide where is a good place in your XML file, where you can say: "This part here contains a lot of repeating structure, I will process these repetitions one at a time." These repeating parts are usually the same (child) tags, repeatedly repeating within the parent tag. So, all you have to do is make an event listener in your STAX parser that runs at the beginning of each of these child tags, and not pass the contents of that child tag to JAXB, sort it with JAXB and process it.

In fact, the idea is excellently described in this article, which I followed (although this is from 2006, but it is about JDK 1.6, which at that time was quite new, so the version is essentially not that old):

http://www.javarants.com/2006/04/30/simple-and-efficient-xml-parsing-using-jaxb-2-0/

+2
source

I would recommend Eclipse EMF. But he has the same problem, if you give him a file name, he will analyze all this. Although there are several ways to reduce the number of downloads, I wasn’t very worried when we start the machine with 96 GB of RAM. :)

In any case, if your XML format is well defined, then one workaround is to trick EMF into breaking up the whole file into several smaller (but still clearly defined) XML fragments. Then serve each piece one by one. I do not know JAX-B, but perhaps the same workaround can be applied there as well. Which I would recommend because EMF is too big a hammer for such a small problem.

Just clarify a bit if your XML looks like this:

<tag1> <tag2> <tag3/> <tag4> <tag5/> </tag4> <tag6/> <tag7/> </tag2> <tag2> <tag3/> <tag4> <tag5/> </tag4> <tag6/> <tag7/> </tag2> ............ <tag2> <tag3/> <tag4> <tag5/> </tag4> <tag6/> <tag7/> </tag2> </tag1> 

Then it can be split into one XML, starting with <tag2> and ending with </tag2> . And in java, most parsers accepted Stream, so just parse using whatever you want, create some StringStream or something for each <tag2> in the loop and go to JAX-B or EMF.

NTN

+5
source

A forecast of the document may be the answer here. Saxon and a number of other XQuery processors offer this as an option. If you have a fairly simple query that selects a small amount of data from a large document, the query processor analyzes the query to determine which parts of the tree should be available for the query and which might be discarded during processing. The resulting tree can often be as little as 1% of the size of a complete document. Details of Saxon here:

http://saxonica.com/documentation/sourcedocs/projection.xml

+1
source

Source: https://habr.com/ru/post/899169/


All Articles