I am currently using LIBXML :: SAXParser :: Callbacks to parse a large XML file containing data from 140,000 products. I am using a task to import data for these products into my rails application.
My last import took a little less than 10 hours:
rake asi:import_products --trace 26815.23s user 1393.03s system 80% cpu 9:47:34.09 total
The problem with the current implementation is that the complex dependency structure in XML means that I need to track the entire node product in order to know how to parse it correctly.
Ideally, I would like to be able to process each node product on its own and be able to use XPATH, the file size limits us to using a method that requires loading the entire XML file into memory. I cannot control the format or size of the source XML. I have a maximum of 3 GB of memory that I can use in the process.
Is there a better way than this?
Current rake code:
XML file fragment:
source share