What is the fastest way to output multiple element values from XML files in Perl?

Question

What is the fastest way to output multiple element values from XML files in Perl?

I have a bunch of XML files about 1-2 megabytes in size. In fact, more than a bunch, there are millions. They are all well-formed, and many are even argued against their scheme (confirmed by libxml2).

All were created by the same application, so they are in a consistent format (although this may theoretically change in the future).

I want to check the values of one element in each file from a Perl script. Speed is important (I would like to take less than a second per file), and, as already noted, I already know that the files are well-formed.

I'm really sorry to just “open” the files in Perl and check until I see the item I'm looking for, take the value (which is near the beginning of the file) and close the file.

On the other hand, I could use an XML parser (which could protect me from future XML formatting changes), but I suspect it will be slower than I would like.

Can you recommend a suitable approach and / or parser?

Thanks in advance.

Update

Here is the structure / complexity of the data I'm trying to pull:

<doc>
  ...
  <someparentnode attrib="notme" attrib2="5">
    <node>Not this one</node>
  </someparentnode>
  <someparentnode attrib="pickme" attrib2="5">
    <node>This is the data I want</node>
  </someparentnode>
  <someparentnode attrib="notme" 
     attrib2="reallyreallylonglineslikethisonearewrapped">
    <node>Not this one either and it may be 
      wrapped too.</node>
  </someparentnode>
  ...    
</doc>

The hierarchy goes several levels deeper than this, but I think it covers what I am trying to do.

+3

performance xml perl

Anon gordon Mar 14 '10 at 8:44

source share

3 answers

, XML:: Bare XML:: Simple XML:: Twig.

2-5 , : 0,2 4 , . : http://darkpan.com/files/xml-parsing-perl-gripes.txt.

0

mfontani 15 . '10 13:43

Awk

awk 'BEGIN{
 RS="</doc>"
 FS="</someparentnode>"
}

{
  for(i=1;i<=NF;i++){
     if( $i~/pickme/){
        m=split($i,a,"</node>")
        for(o=1;o<=m;o++){
          if(a[o]~/<node>/){
            gsub(/.*<node>/,"",a[o])
            print a[o]
          }
        }
     }
  }
}' file

Perl

#!/usr/bin/perl
$/ = '</doc>';
$FS = '</someparentnode>';
while (<>) {
    chomp;
    @F = split $FS,;
    for ($i=0;$i<=$#F; $i++) {
        if ($F[$i] =~ /pickme/) {
            $M=(@a=split('</node>', $F[$i]));
            for ($o=0; $o<$M; $o++) {
                if ($a[$o]=~/<node>/) {
                    $a[$o] =~ s/.*<node>//sg;
                    print $a[$o];
                }
            }
        }
    }
}

$ perl script.pl file
This is the data I want

$ ./shell.sh
This is the data I want

-2

ghostdog74 14 . '10 8:59

mirod · Accepted Answer · 2010-03-14T10:09:33+0000

2 XML- ( , ; -) xml_grep ( XML:: Twig) xml_grep2 ( App:: xml_grep2).

xml_grep -t '*[@attrib="pickme"]' *.xml xml_grep2 -t '//*[@attrib="pickme"]' *.xml ( -t XML). , xml_grep , , .

, , , XML:: Twig, , (), finish_now, , .

XML:: LibXML , XPath (, ), SAX ( , ) pull- (, , ).

: XML:: Twig :

#!/usr/bin/perl
use strict;
use warnings;

use XML::Twig;

my $twig= XML::Twig->new( twig_handlers => { '*[@attrib="pickme"]' => \&pickme });

foreach my $file (@ARGV)
  { $twig->parsefile( $file); }

sub pickme
  { my( $twig, $node)= @_;
    print $node->text, "\n";
    $twig->finish_now;
  }

What is the fastest way to output multiple element values ​​from XML files in Perl?

More articles:

What is the fastest way to output multiple element values from XML files in Perl?