List of all child nodes of a document item

I have a very large XML file and I want to list all the child nodes for a document element. I use the code below that works fine, but it takes a lot of time to process the file, as well as retrieving data from a document item that is not required:

use XML::Simple; my $xml = XML::Simple->new(); my $d = $xml->XMLin("sample.xml"); my @arr = keys %$d; print "@arr\n"; 

XML example:

  <?xml version="1.0" encoding="ISO-8859-15"?> <document version="1.0" createdAt="2017-03-31T11:41:34"> <TITLE>Computer Parts</TITLE> <PART001> <ITEM>Motherboard</ITEM> <MANUFACTURER>ASUS</MANUFACTURER> <MODEL>P3B-F</MODEL> <COST> 123.00</COST> </PART001> <PART002> <ITEM>Video Card</ITEM> <MANUFACTURER>ATI</MANUFACTURER> <MODEL>All-in-Wonder Pro</MODEL> <COST> 160.00</COST> </PART002> <PART003> <ITEM>Sound Card</ITEM> <MANUFACTURER>Creative Labs</MANUFACTURER> <MODEL>Sound Blaster Live</MODEL> <COST> 80.00</COST> </PART003> <PART004> <ITEM>14 inch Monitor</ITEM> <MANUFACTURER>LG Electronics</MANUFACTURER> <MODEL> 995E</MODEL> <COST> 290.00</COST> </PART004> </document> 

Expected Result: TITLE, PART001, PART002, PART003, PART004

Can anyone suggest a faster and better way to get the desired result?

+5
source share
2 answers

Using XML :: LibXML and XPath .

 use 5.014; use warnings; use XML::LibXML; my $file = 'xml'; my $dom = XML::LibXML->load_xml(location => $file); for my $child ($dom->findnodes( q{//document/*} )) { say $child->nodeName(); } 

Output

 TITLE PART001 PART002 PART003 PART004 

or just for the case if you only need PART s

 for my $part ($dom->findnodes( q{//*[contains(name(),'PART')]} )) { say $part->nodeName(); } 

Output

 PART001 PART002 PART003 PART004 

EDIT: Using pull parsing (doesn't load all xml into memory):

 use 5.014; use warnings; use XML::LibXML::Reader qw(XML_READER_TYPE_ELEMENT); my $file="xml"; my $reader = XML::LibXML::Reader->new(location => $file) or die "problem $!"; while($reader->read()) { if( $reader->depth == 1 && $reader->nodeType == XML_READER_TYPE_ELEMENT ) { say $reader->name; } } TITLE PART001 PART002 PART003 PART004 

EDIT2

 use 5.014; use warnings; use XML::LibXML::Reader qw(XML_READER_TYPE_ELEMENT); my $file="xml"; my $reader = XML::LibXML::Reader->new(location => $file) or die "problem $!"; my $indoc; while($reader->read()) { # sets the flag in youre inside the <document> if( $reader->name eq 'document' ) { $indoc = $reader->nodeType == XML_READER_TYPE_ELEMENT ? 1 : 0; } # all nodes with level 1 if they're inside of the <document> if( $indoc && $reader->depth == 1 && $reader->nodeType == XML_READER_TYPE_ELEMENT ) { say $reader->name; } } 
+6
source

You can use XML::Twig , which, according to its documentation, is a perl module for processing huge XML documents in tree mode.

Here is an example suitable for your use case:

 use feature qw(say); use XML::Twig; XML::Twig->new(twig_handlers => { 'document/*' => sub { say $_->name; # print out the element name $_->purge; # remove the entire element from memory } })->parsefile('sample.xml'); 

When used with your sample document, this prints:

 TITLE PART001 PART002 PART003 PART004 

Using a stream analyzer can be even faster.

+5
source

Source: https://habr.com/ru/post/1266158/


All Articles