Reading a huge XML file using StAX and XPath

The input file contains thousands of transactions in XML format of about 10 GB. The requirement is to select each XML transaction based on user input and send it to the processing system.

Example file contents

<transactions> <txn id="1"> <name> product 1</name> <price>29.99</price> </txn> <txn id="2"> <name> product 2</name> <price>59.59</price> </txn> </transactions> 

It is expected that the user (technical) will give the name of the input tag, for example <txn> .

We would like this decision to be more general. The contents of the file may vary, and users may give an XPath expression, such as " //transactions/txn ", to select individual transactions.

There are a few technical things that we should consider here.

  • The file may be in a shared folder or FTP
  • Since the file size is huge, we cannot load the entire file in the JVM

Can I use a StAX script for this script? It should take an XPath expression as input and select / select an XML transaction.

Look for offers. Thanks in advance.

+6
source share
7 answers

Stax and xpath are two different things. Stax allows you to parse an XML streaming document in the forward direction only. Xpath allows parsing in both directions. Stax is a very fast XML stream parser, but if you want xpath, java has a separate library for that.

Take a look at this question for a very similar discussion: Is there any XPath processor for the SAX model?

+8
source

If performance is an important factor and / or the size of the document is large (both of which appear to be here), the difference between an event parser (e.g., SAX or StAX) and native implementation of Java XPath is that the latter builds a D3C DOM document before evaluation XPath expressions. [It is interesting to note that all implementations of the Java Object Object Model, such as the DOM or Axiom, use an event processor (for example, SAX or StAX) to create a view in memory, so if you can ever manage with an event processor only, save it as memory , and the time required to create the DOM.]

As I mentioned, the XPath implementation in the JDK works with the W3C DOM document. You can see this in the Java JDK source code implementation by looking at com.sun.org.apache.xpath.internal.jaxp.XPathImpl , where before calling the evaluation method (), the parser must first analyze the source:

  Document document = getParser().parse( source ); 

After that, your 10GB XML will be presented in memory (plus any overhead) - probably not what you want. Although you might need a more β€œgeneral" solution, both your XPath example and your XML markup seem relatively simple, so it seems to be not a very strong excuse for XPath (with the possible exception of programming elegance). The same would be true for the XProc proposal: it would also create a DOM. If you really need a DOM, you can use Axiom, not the W3C DOM. Axiom has a much friendlier API and builds its DOM on top of StAX, so it works fast and uses Jaxen to implement XPath. Jaxen requires a kind of DOM (W3C DOM, DOM4J or JDOM). This will be true for all XPath implementations, so if you really don't want XPath to adhere only to the event parser, it is recommended.

SAX is the old streaming API, with the new StAX and much faster. Either using the StAX JDK built-in implementation ( javax.xml.stream ), or the StAX implementation Woodstox (which is much faster in my experience), I would recommend creating an XML Event Filter that first matches the element type name (to capture your <txn> elements <txn> ). This will create small event packages (element, attribute, text) that can be checked for compliance with your custom values. With a suitable match, you can either pull the necessary information from the events, or bind the restricted events to build a mini-DOM from them, if you find that the result was easier to navigate. But it looks like this might be redundant if the markup is simple.

This is most likely the easiest and fastest approach and avoid the overhead of memory for creating the DOM. If you passed the element and attribute names to the filter (so that your matching algorithm is customizable), you could make it relatively general.

+13
source

This is definitely a use case for XProc with the implementation of streaming and parallel processing such as QuiXProc ( http://code.google.com/p/quixproc )

In this situation you will have to use

  <p:for-each> <p:iteration-source select="//transactions/txn"/> <!-- you processing on a small file --> </p:for-each> 

You can even wrap each of the resulting transformations with a single XProc line

  <p:wrap-sequence wrapper="transactions"/> 

Hope this helps

+1
source

We regularly parse 1GB + complex XML files using the SAX parser, which does exactly what you described: it extracts partial DOM trees that can be conveniently requested using XPATH.

I took care of this here - It uses SAX not a StAX parser, but it might be worth a look.

+1
source

Stream conversions for XML (STX) may be what you need.

0
source

Do you need to process it quickly or do you need quick searches in the data? These requirements require a different approach.

To quickly read all the data, StAX will be fine.

If you need quick search queries than you might need to load it into any database, Berkeley DB XML, for example.

0
source

A fun solution for processing huge XML files> 10 GB.

  • Use ANTLR to create byte offsets for the parts you are interested in. This will save memory compared to the DOM based approach.
  • Use Jaxb to read details from a byte position.

Find information on the Wikipedia dumps example (17 GB) in this SO answer fooobar.com/questions/646543 / ...

0
source

Source: https://habr.com/ru/post/896044/


All Articles