Parsing a large XML file via FTP

I need to parse a large XML file (> 1 GB) that resides on an FTP server. I have an FTP stream received by ftp_connect (). (I use this thread for other FTP related activities)

I know that XMLReader is preferable for large XML files, but it will only accept URIs. Therefore, I assume that a flow wrapper is required. And the only ftp function that I know of that will allow me to get only a small part of the file is ftp_nb_fget () in combination with ftp_nb_continue ().

However, I do not know how I should put it all together to make sure that the minimum amount of memory is used.

+4
source share
3 answers

It looks like you might need to build on top of the low-level bits of an XML parser .

In particular, you can use xml_parse to process XML one fragment of an XML string at a time, after calling various xml_set_* functions with callbacks to process elements, character data, namespaces, objects, etc. These callbacks will be triggered whenever the analyzer detects that it has enough data to do this, which should mean that you can process the file when you read it in randomly sized chunks from an FTP site.


Proof of the concept using the CLI and xml_set_default_handler , which will be called for everything that does not have a specific handler:

 php > $p = xml_parser_create('utf-8'); php > xml_set_default_handler($p, function() { print_r(func_get_args()); }); php > xml_parse($p, '<a'); php > xml_parse($p, '>'); php > xml_parse($p, 'Foo<b>Bar</b>Baz'); Array ( [0] => Resource id #3 [1] => <a> ) Array ( [0] => Resource id #3 [1] => Foo ) Array ( [0] => Resource id #3 [1] => <b> ) Array ( [0] => Resource id #3 [1] => Bar ) Array ( [0] => Resource id #3 [1] => </b> ) php > xml_parse($p, '</a>'); Array ( [0] => Resource id #3 [1] => Baz ) Array ( [0] => Resource id #3 [1] => </a> ) php > 
0
source

Hmm, I never tried this with FTP, but setting up the context of the stream can be done using

Then just put in the FTP URI in open() .

EDIT: Note that you can also use the thread context for other actions. If you upload files, you can probably use the same stream context in conjunction with file_put_contents , so you don't necessarily need any of the ftp * functions.

0
source

This will depend on the layout of your XML file. But if this is something similar to RSS in that it is really just a long list of elements (all encapsulated in a tag), then what I did is to parse individual sections and parse them as separate domdocuments:

 $buffer = ''; while ($line = getLineFromFtp()) { $buffer .= $line; if (strpos($line, '</item>') !== false) { parseBuffer($buffer); $buffer = ''; } } 

This is pseudo code, but it is an easy way to process a specific type of XML file without creating your own XMLReader. Of course, you will also need to check the opening tags to ensure that the buffer is always a valid xml file.

Note that this will not work with all types of XML. But if it is suitable, this is a simple and understandable way to do this, while at the same time preserving the maximum possible amount of memory ...

0
source

Source: https://habr.com/ru/post/1343843/


All Articles