What is the fastest way to output multiple element values โ€‹โ€‹from XML files in Perl?

I have a bunch of XML files about 1-2 megabytes in size. In fact, more than a bunch, there are millions. They are all well-formed, and many are even argued against their scheme (confirmed by libxml2).

All were created by the same application, so they are in a consistent format (although this may theoretically change in the future).

I want to check the values โ€‹โ€‹of one element in each file from a Perl script. Speed โ€‹โ€‹is important (I would like to take less than a second per file), and, as already noted, I already know that the files are well-formed.

I'm really sorry to just โ€œopenโ€ the files in Perl and check until I see the item I'm looking for, take the value (which is near the beginning of the file) and close the file.

On the other hand, I could use an XML parser (which could protect me from future XML formatting changes), but I suspect it will be slower than I would like.

Can you recommend a suitable approach and / or parser?

Thanks in advance.

Update

Here is the structure / complexity of the data I'm trying to pull:

<doc>
  ...
  <someparentnode attrib="notme" attrib2="5">
    <node>Not this one</node>
  </someparentnode>
  <someparentnode attrib="pickme" attrib2="5">
    <node>This is the data I want</node>
  </someparentnode>
  <someparentnode attrib="notme" 
     attrib2="reallyreallylonglineslikethisonearewrapped">
    <node>Not this one either and it may be 
      wrapped too.</node>
  </someparentnode>
  ...    
</doc>

The hierarchy goes several levels deeper than this, but I think it covers what I am trying to do.

+3
source share
3 answers

2 XML- ( , ; -) xml_grep ( XML:: Twig) xml_grep2 ( App:: xml_grep2).

xml_grep -t '*[@attrib="pickme"]' *.xml xml_grep2 -t '//*[@attrib="pickme"]' *.xml ( -t XML). , xml_grep , , .

, , , XML:: Twig, , (), finish_now, , .

XML:: LibXML , XPath (, ), SAX ( , ) pull- (, , ).

: XML:: Twig :

#!/usr/bin/perl
use strict;
use warnings;

use XML::Twig;

my $twig= XML::Twig->new( twig_handlers => { '*[@attrib="pickme"]' => \&pickme });

foreach my $file (@ARGV)
  { $twig->parsefile( $file); }

sub pickme
  { my( $twig, $node)= @_;
    print $node->text, "\n";
    $twig->finish_now;
  }
+8

, XML:: Bare XML:: Simple XML:: Twig.

2-5 , : 0,2 4 , . : http://darkpan.com/files/xml-parsing-perl-gripes.txt.

0

Awk

awk 'BEGIN{
 RS="</doc>"
 FS="</someparentnode>"
}

{
  for(i=1;i<=NF;i++){
     if( $i~/pickme/){
        m=split($i,a,"</node>")
        for(o=1;o<=m;o++){
          if(a[o]~/<node>/){
            gsub(/.*<node>/,"",a[o])
            print a[o]
          }
        }
     }
  }
}' file

Perl

#!/usr/bin/perl
$/ = '</doc>';
$FS = '</someparentnode>';
while (<>) {
    chomp;
    @F = split $FS,;
    for ($i=0;$i<=$#F; $i++) {
        if ($F[$i] =~ /pickme/) {
            $M=(@a=split('</node>', $F[$i]));
            for ($o=0; $o<$M; $o++) {
                if ($a[$o]=~/<node>/) {
                    $a[$o] =~ s/.*<node>//sg;
                    print $a[$o];
                }
            }
        }
    }
}

$ perl script.pl file
This is the data I want

$ ./shell.sh
This is the data I want
-2

Source: https://habr.com/ru/post/1736866/


All Articles