Parsing a very large XML file with Ruby on Rails (1.4 GB). Is there a better way than SAXParser?

Question

Parsing a very large XML file with Ruby on Rails (1.4 GB). Is there a better way than SAXParser?

I am currently using LIBXML :: SAXParser :: Callbacks to parse a large XML file containing data from 140,000 products. I am using a task to import data for these products into my rails application.

My last import took a little less than 10 hours:

rake asi:import_products --trace 26815.23s user 1393.03s system 80% cpu 9:47:34.09 total

The problem with the current implementation is that the complex dependency structure in XML means that I need to track the entire node product in order to know how to parse it correctly.

Ideally, I would like to be able to process each node product on its own and be able to use XPATH, the file size limits us to using a method that requires loading the entire XML file into memory. I cannot control the format or size of the source XML. I have a maximum of 3 GB of memory that I can use in the process.

Is there a better way than this?

Current rake code:

XML file fragment:

+4

ruby xml ruby-on-rails saxparser

Dbruns May 18, '10 at 19:19

source share

1 answer

Eimantas · Answer 1 · 2010-05-18T19:25:49+0000

Can you get the whole file first? If so, I would suggest splitting the XML file into smaller chunks (say 512 MB or so) so that you can parse concurrent chunks at a time (one per core) because I believe you have a modern processor . As for invalid or invalid xml - just add or add the missing XML with simple string handling.

You can also try profiling your callback method. This is a big piece of code, I'm sure there should be at least one bottle that could save you a few minutes.

Parsing a very large XML file with Ruby on Rails (1.4 GB). Is there a better way than SAXParser?

More articles: