Iterate a large XML file and copies Select nodes

I need to iterate through a large XML file (~ 2 GB) and selectively copy specific nodes to one or more separate XML files.

My first thought is to use XPath to iterate through the appropriate nodes and for each node test to which another node file should be copied, for example:

var doc = new XPathDocument(@"C:\Some\Path.xml"); var nav = doc.CreateNavigator(); var nodeIter = nav.Select("//NodesOfInterest"); while (nodeIter.MoveNext()) { foreach (Thing thing in ThingsThatMightGetNodes) { if (thing.AllowedToHaveNode(nodeIter.Current)) { thing.WorkingXmlDoc.AppendChild(... nodeIter.Current ...); } } } 

In this implementation, Thing defines a public System.Xml.XmlDocument WorkingXmlDoc to hold nodes in AllowedToHave() . However, I do not understand how to create a new XmlNode, which is a copy of nodeIter.Current.

If there was a better approach, I would also be happy to hear it.

+4
source share
3 answers

Evaluating an XPath expression requires that the entire XML document (XML Infoset) be in RAM.

For an XML file with a textual representation of more than 2 GB, usually more than 10 GB of RAM should be available only for storing the XML document.

Therefore, although this is not impossible, it may be preferable (especially on a server that should have resources that are quickly available for many requests) to use another method.

XmlReader (based classes) is a great tool for this scenario. It works quickly, only forward and does not require the storage of readable nodes in memory. In addition, your logic will remain almost the same.

+3
source

You should consider LINQ to XML. Check out this blog post for details and examples:

http://james.newtonking.com/archive/2007/12/11/linq-to-xml-over-large-documents.aspx

+1
source

Try an XQuery processor that implements projection of a document (an idea first published by Marion and Simeon). It is implemented in a number of processors, including Saxon-EE. Basically, if you run a request, for example //, it will filter the stream of input events and build a tree that contains only the information necessary to process this request; it will execute the request in the usual way, but against a much smaller tree. If this is a small part of the overall document, you can easily reduce the memory requirement by 95% or so.

0
source

Source: https://habr.com/ru/post/1398921/


All Articles