Optimizing the speed of parsing an XML file using VTD-XML

I am analyzing a large number of XML files using VTD-XML. I'm not sure if I use this tool correctly - I think so, but file analysis takes too much time.

xml files (in DATEXII format) are files with zip files on HD. Unpacked, they are about 31 MB in size, containing just over 850,000 lines of text. I need to extract only a few fields and store them in the database.

import org.apache.commons.lang3.math.NumberUtils; ... private static void test(File zipFile) throws XPathEvalException, NavException, XPathParseException { // init timer long step1=System.currentTimeMillis(); // open file to output extracted fragments VTDGen vg = new VTDGen(); vg.parseZIPFile(zipFile.getAbsolutePath(), zipFile.getName().replace(".zip",".xml"),true); VTDNav vn = vg.getNav(); AutoPilot apSites = new AutoPilot(); apSites.declareXPathNameSpace("ns1", "http://schemas.xmlsoap.org/soap/envelope/"); apSites.selectXPath("/ns1:Envelope/ns1:Body/d2LogicalModel/payloadPublication/siteMeasurements"); apSites.bind(vn); long step2=System.currentTimeMillis(); System.out.println("Prep took "+(step2-step1)+"ms; "); // init variables String siteID, timeStr; boolean reliable; int index, flow, ctr=0; short speed; while(apSites.evalXPath()!=-1) { vn.toElement(VTDNav.FIRST_CHILD, "measurementSiteReference"); siteID = vn.toString(vn.getText()); // loop all measured values of this measurement site while(vn.toElement(VTDNav.NEXT_SIBLING, "measuredValue")) { ctr++; // extract index attribute index = NumberUtils.toInt(vn.toString(vn.getAttrVal("index"))); // go one level deeper into basicDataValue vn.toElement(VTDNav.FIRST_CHILD, "basicDataValue"); // we need either FIRST_CHILD or NEXT_SIBLING depending on whether we find something int next = VTDNav.FIRST_CHILD; if(vn.toElement(next, "time")) { timeStr = vn.toString(vn.getText()); next = VTDNav.NEXT_SIBLING; } if(vn.toElement(next, "averageVehicleSpeed")) { speed = NumberUtils.toShort(vn.toString(vn.getText())); next = VTDNav.NEXT_SIBLING; } if(vn.toElement(next, "vehicleFlow")) { flow = NumberUtils.toInt(vn.toString(vn.getText())); next = VTDNav.NEXT_SIBLING; } if(vn.toElement(next, "fault")) { reliable = vn.toString(vn.getText()).equals("0"); } // insert into database here... if(next==VTDNav.NEXT_SIBLING) { vn.toElement(VTDNav.PARENT); } vn.toElement(VTDNav.PARENT); } } System.out.println("Loop took "+(System.currentTimeMillis()-step2)+"ms; "); System.out.println("Total number of measured values: "+ctr); } 

The exact function output above for my XML files is:

 Prep took 25756ms; Loop took 26889ms; Total number of measured values: 112611 

Currently, data is not inserted into the database. Now the problem is that I get one of these files every minute. The total parsing time is almost 1 minute, and since downloading files takes about 10 seconds, and I need to store data in a database, now I'm running for real time.

Is there any way to speed this up? Things I tried that didn't help:

  • Use autopilots for all fields, this actually made the second step slower by 30,000 m.
  • Unzip the file yourself and parse the byte array into VTD, it didn't make any difference.
  • Compile the file yourself using BufferedReader readLine (), but this is not fast enough.

Does anyone see an opportunity to speed things up, or do I need to start thinking about a heavier machine / multithreading? Of course, 850,000 rows per minute (1.2 billion rows per day) are many, but I still feel that it doesn't take a minute to parse 31 MB of data ...

+4
source share
1 answer

You can immediately unzip the folder and save the values โ€‹โ€‹of each xml file in an array using

 File[] files = new File("foldername").listFiles(); 

and then you can make a loop that goes through each file, I'm not sure if this will speed it up, but it's worth it.

+1
source

Source: https://habr.com/ru/post/1396271/


All Articles