Cutting a large XML file into small pieces

I have a large Wikipedia dump that I want to cut into different files (1 file for each article). I wrote a VB application to do this for me, but it was rather slow and crashed after hours of cutting. Im currently splitting the file into smaller 50 MB chunks using another application, but it takes a lot of time (20-30 minutes for each fragment). I must be able to cut each one individually if I do.

Does anyone have any suggestions for shrinking this file faster?

+3
source share
4 answers

The easiest way to do this with C # is with XmlReader. You can either stay with XmlReader yourself for the fastest implementation, or combine with the new LINQ XNode classes for a decent combination of performance and ease of use. See this MSDN article for an example: http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx .

You should be able to modify the example to only hold the node for one document in memory at a time, and then write it back as a file. It should work well and work with very large files.

+3
source

, DOM. SAX. DOM , SAX , . , # SAX, VB .

0

Java, javax.xml.stream.XMLEventReader javax.xml.stream.XMLEventWriter.

- , <article> wikipedia, <article>, openNewWriter(), XMLEventWriter, .

:

XMLEventReader r = // an XMLEventReader for the original wikipedia dump

XMLEventWriter w = null;

bool isInsideArticle = false;

while (r.hasNext()){
  XMLEvent e = r.nextEvent();

  if (e.isStartElement() &&
        e.asStartElement().getName().getLocalPart().equals("article")){
     w = openNewWriter();
     // write the stuff that belongs outside the <article> tag
     // by synthesizing XMLEvents and using w.add() to add them
     w.add(e);
     isInsideArticle = true;
  } else if (e.isEndElement() &&
           e.asEndElement().getName().getLocalPart().equals("article")) {
     w.add(e);
     // write the stuff that belongs outside the <article> tag
     // by synthesizing XMLEvents and using w.add() to add them
     isInsideArticle = false;
     w.close();
  } else if (isInsideArticle) {
     w.add(e);
  } else {
     // this tag gets dropped on the floor because it not inside any article
  }
}

XML- .NET. , system.xml.XMLReader system.xml.XMLWriter, .NET, , , Java-, .

( , , , .)

0

Source: https://habr.com/ru/post/1788762/


All Articles