Parsing / scanning through a 17 gb file

Question

Parsing / scanning through a 17 gb file

I am trying to parse a stackoverflow dump file (Posts.xml- 17gb). It has the form:

<posts> <row Id="15228715" PostTypeId="1" /> . <row Id="15228716" PostTypeId="2" ParentId="1600647" LastActivityDate="2013-03-05T16:13:24.897"/> </posts>

I have to "group" each question with their answers. Basically find the question (posttypeid = 1) find its answers using the parentId of another line and save it in db.

I tried doing this with querypath (DOM), but it continued to remain (139). I assume that due to the large file size, my computer could not handle this, even with a huge exchange.

I looked at xmlreader, but, as I see it, using xmlreader, the program will read the file many times (find a question, look for answers, repeat many times) and, therefore, is not viable. I'm wrong?

Is there any other way / method?

Help!

This is a one-time analysis.

+4

xml php xml-parsing

gyaani_guy Jun 2 '13 at 10:09

source share

3 answers

Since the way to handle this large file is not sequential, but requires direct access, I think the only viable option is to load the data into an XML database.

+2

Michael kay Jun 2 '13 at 13:22

source share

Using PHP xmlreader is the right thing.

Reason: Due to your statement:

I have to "group" each question with their answers. Basically find the question (posttypeid = 1) find the answers using the parentId of another line and save it in db.

As far as I understand, you like to create a database with answers to questions. Therefore, there is no reason to do “grouping” at the XML level. Put all the necessary information in the database and do the grouping at the database level - with the db (sql ...) commands.

What you should use something like "Using the parser-address method" For example [High-performance XML parsing in Python with xml (Even if it is for Python, this is a good start). This should be possible with XMLReader.

+1

hr_117 Jun 2 '13 at 11:30

source share

hakre · Accepted Answer · 2013-06-03T09:15:41+0000

I looked at xmlreader, but, as I see it, using xmlreader, the program will read the file many times (find a question, look for answers, repeat many times) and, therefore, is not viable. I'm wrong?

Yes, you are mistaken. With XMLReader, you specify your own how often you want to go through this file (usually you do it once). In your case, I see no reason why you cannot even insert this 1: 1 into each <row> element. You can choose for the attribute which database (table?) You want to insert.

I usually suggest a set of iterators that make it easy to navigate using XMLReader. It is called XMLReaderIterator and allows foreach on XMLReader to make code often easier to read and write:

 $reader = new XMLReader(); $reader->open($xmlFile); /* @var $users XMLReaderNode[] - iterate over all <post><row> elements */ $posts = new XMLElementIterator($reader, 'row'); foreach ($posts as $post) { $isAnswerInsteadOfQuestion = (bool)$post->getAttribute('ParentId') $importer = $isAnswerInsteadOfQuestion ? $importerAnswers : $importerQuestions; $importer->importRowNode($post); }

If you are worried about the order (for example, you may be afraid that some of the parent's answers will not be available during the answers), I would take care of it inside the importer layer and not inside the bypass.

Depending on what happens often, very often, never or not at all, I will use a different strategy. For instance. because I would never insert directly into database tables with foreign key constraints enabled. If often, I would create an insert transaction for the entire import, in which the key restrictions were canceled and re-activated at the end.

Parsing / scanning through a 17 gb file

More articles: