How to efficiently read a large XML file consisting of a large number of small elements in Java?

I have a large XML file consisting of elements with a relatively fixed size i.e.

<rootElem>
  <item>...</item>

  <item>...</item>
  <item>...</item>
<rootElem>

Item elements are relatively shallow and usually quite small (<100 KB), but there can be many (hundreds of thousands) of them. Elements are completely independent of each other.

How can I efficiently process a file in Java? I cannot read the whole file as a DOM, and I do not like to use SAX because the code becomes quite complicated. I would like to avoid splitting the file into smaller parts.

It would be best if I could get each element item , one at a time, as a separate DOM document that I could handle with the help of tools such as JAXB. Basically, I just want to loop once over all the elements.

I would think that this is a fairly common problem.

+3
source share
4 answers

Java 6 has StAX support . It performs thread processing, such as SAX, but uses a pull-based approach that simplifies the processing code.

+3
source

, ( a.k.a.) , , . , SAX (, , ), , . XML- , , , , , .

, SAX , / DOM. ( - , /, , .)

+1

, ... XML , BufferedReader, <item> StringBuffer. ( ) DOM . DocumentBuilder.

, - DOM. , XML: XML , ( <item/>?), .

, XML (, ), XML . , SAX , DOM-, .

, SAX StAX DOM- . , .

0

Using the DOM, I have an efficient way to parse xml. I myself prepared this DOM parser using recursion, which will parse your xml without knowing a single tag. It will provide you with each node text content, if it exists, in sequence. You can remove the comment in the following code to get the name node. Hope this helps.

import java.io.BufferedWriter;
import java.io.File;  
import java.io.FileInputStream;  
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;

 import javax.xml.parsers.DocumentBuilder;  
 import javax.xml.parsers.DocumentBuilderFactory;  
 import org.w3c.dom.Document;  
 import org.w3c.dom.Node;  
 import org.w3c.dom.NodeList;  



public class RecDOMP {


public static void main(String[] args) throws Exception{
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();  
        dbf.setValidating(false); 
        DocumentBuilder db = dbf.newDocumentBuilder();   

// replace following  path with your input xml path  
         Document doc = db.parse(new FileInputStream(new File  ("D:\\ambuj\\input.xml")));  

// replace following  path with your output xml path 
         File OutputDOM = new File("D:\\ambuj\\outapip1.txt");
            FileOutputStream fostream = new FileOutputStream(OutputDOM);
            OutputStreamWriter oswriter = new OutputStreamWriter (fostream);
            BufferedWriter bwriter = new BufferedWriter(oswriter);

            // if file doesnt exists, then create it
            if (!OutputDOM.exists()) {
                OutputDOM.createNewFile();}


            visitRecursively(doc,bwriter);
            bwriter.close(); oswriter.close(); fostream.close();

            System.out.println("Done");
}
public static void visitRecursively(Node node, BufferedWriter bw) throws IOException{  

             // get all child nodes  
         NodeList list = node.getChildNodes();                                  
         for (int i=0; i<list.getLength(); i++) {          
                 // get child node              
       Node childNode = list.item(i);  
       if (childNode.getNodeType() == Node.TEXT_NODE)
       {
   //System.out.println("Found Node: " + childNode.getNodeName()           
    //   + " - with value: " + childNode.getNodeValue()+" Node type:"+childNode.getNodeType()); 

   String nodeValue= childNode.getNodeValue();
   nodeValue=nodeValue.replace("\n","").replaceAll("\\s","");
   if (!nodeValue.isEmpty())
   {
       System.out.println(nodeValue);
       bw.write(nodeValue);
       bw.newLine();
   }
       }
       visitRecursively(childNode,bw);  

            }         

     }  

}
0
source

Source: https://habr.com/ru/post/1763499/


All Articles