Search for regular expression patterns in a 30 GB XML dataset. Using 16 GB of memory

I currently have a Java SAX parser that extracts some information from a 30 gigabyte XML file.

Currently it is:

  • reading every xml node
  • saving it to a string object,
  • running some regexex in line
  • storing results in a database

For several million items. I run this on a computer with 16 GB of memory, but the memory is not fully used.

Is there an easy way to dynamically “buffer” about 10 gigabytes of data from an input file?

I suspect that I can manually take the multi-threaded version of the "producer" of the "consumer" (loading objects on one side, using them and discarding them on the other), but damnit, now XML is ancient, no effective libraries to crunch em?

+3
source share
10 answers

First try to find out what slows you down.

  • How much faster is the parser when you parse memory?
  • Does it use BufferedInputStreamwith a lot of help?

Is it easy to split an XML file? In general, shuffling through 30 gigabytes of any data will take some time, since you first download it from the hard drive, so you are always limited by the speed of this. Can you distribute the load across multiple machines, perhaps using something like Hadoop ?

+2
+4

Java, , , , ? SAX ...

+2

SAX, , " ", , , , , . , ? "" node ( ) , .

+2

, XML, ,

  • XML
  • , ( SAX, )

: XML . , - ?

+1

db? , db , , , . ,

, - , ,

+1

Jibx "" XML , . ArrayList, , x , (, , ), , "" .

Jibx SourceForge: Jibx

: XML "" String. , , . ArrayList.

( , , void):

public void add(Object o) {
    super.add(o);
    if(size() > YOUR_DEFINED_THRESHOLD) {
        flushObjects();
    }
}

YOUR_DEFINED_THRESHOLD

- arraylist, . flushObjects(); , . XML . , , , , .

+1

XML XML- (, eXist , ), , .

0

, Stax SAX, ( ).

0

XML , , , ? I/O, .

0
source

Source: https://habr.com/ru/post/1697300/


All Articles