Search for regular expression patterns in a 30 GB XML dataset. Using 16 GB of memory

Question

Search for regular expression patterns in a 30 GB XML dataset. Using 16 GB of memory

I currently have a Java SAX parser that extracts some information from a 30 gigabyte XML file.

Currently it is:

reading every xml node
saving it to a string object,
running some regexex in line
storing results in a database

For several million items. I run this on a computer with 16 GB of memory, but the memory is not fully used.

Is there an easy way to dynamically “buffer” about 10 gigabytes of data from an input file?

I suspect that I can manually take the multi-threaded version of the "producer" of the "consumer" (loading objects on one side, using them and discarding them on the other), but damnit, now XML is ancient, no effective libraries to crunch em?

+3

java xml

Achille 21 sept '08 at 9:04

source share

10 answers

, , Java 16 ? () 64- , Java -d64 -XMx10g ( ).
, , . IO CPU. , IO. , IO, , , ; , , .
, , , XML.
this ( this)
SAX, . :
- StAX ( ; Woodstox )
- Javolution
- , JFlex
- ad hoc, . regex
, XML, .
, , , XML- . , , Berkeley DB XML XPath.

+4

ykaganovich 22 . '08 4:06

Java, , , , ? SAX ...

+2

gabr 21 . '08 21:10

SAX, , " ", , , , , . , ? "" node ( ) , .

+2

Will Hartung 21 . '08 21:17

, XML, ,

XML
, ( SAX, )

: XML . , - ?

+1

Thorsten79 21 . '08 21:13

db? , db , , , . ,

, - , ,

+1

Oskar 21 . '08 21:25

Jibx "" XML , . ArrayList, , x , (, , ), , "" .

Jibx SourceForge: Jibx

: XML "" String. , , . ArrayList.

( , , void):

public void add(Object o) {
    super.add(o);
    if(size() > YOUR_DEFINED_THRESHOLD) {
        flushObjects();
    }
}

YOUR_DEFINED_THRESHOLD

- arraylist, . flushObjects(); , . XML . , , , , .

+1

MetroidFan2002 22 . '08 0:17

XML XML- (, eXist , ), , .

0

Joannes Vermorel 21 . '08 21:16

, Stax SAX, ( ).

0

SCdF 21 . '08 21:37

XML , , , ? I/O, .

0

18Rabbit Sep 22 '08 at 15:56

source share

Torsten marek · Accepted Answer · 2008-09-21T22:28:40+0000

First try to find out what slows you down.

How much faster is the parser when you parse memory?
Does it use BufferedInputStreamwith a lot of help?

Is it easy to split an XML file? In general, shuffling through 30 gigabytes of any data will take some time, since you first download it from the hard drive, so you are always limited by the speed of this. Can you distribute the load across multiple machines, perhaps using something like Hadoop ?

Search for regular expression patterns in a 30 GB XML dataset. Using 16 GB of memory

More articles: