Perl: how to handle a stream of XML objects without root node

I need to parse a huge file with Perl. (so I will use a streaming parser ..) The file contains several XML documents (Objects), but not the root node. This causes the XML parser to abort after the first object, as you would expect. The answer is probably that pre / post fixes the fake root of the node.

<FAKE_ROOT_TAG>Original Stream</FAKE_ROOT_TAG>

Since the file is huge (> 1GByte), I don’t want to copy / rewrite it, but I prefer to use a transparent (for XML Parser) class / module, "merge" or "merge" of several streams.

stream1 : <FAKE_ROOT_TAG>                 \
stream2 : Original Stream from file        >   merged stream
stream3 : </FAKE_ROOT_TAG>                / 

Can you point me to such a module or sample code for this problem?

+4
source share
2 answers

Here is a trick extracted from PerlMonks :

#!/usr/bin/perl

use strict;
use warnings;

use XML::Parser;
use XML::LibXML;

my $doc_file= shift @ARGV;

my $xml=qq{
     <!DOCTYPE doc 
           [<!ENTITY real_doc SYSTEM "$doc_file">]
     >
     <doc>
         &real_doc;
     </doc>
};

{ print "XML::Parser:\n";
  my $t= XML::Parser->new( Style => 'Stream')->parse( $xml);
}

{ print "XML::LibXML:\n";
  my $parser = XML::LibXML->new();
  my $doc = $parser->parse_string($xml);
  print $doc->toString;
}
+4
source

Here is a simple example of how you can do this by passing a fake file descriptor to your XML parser. This object overloads the operator readline( <>) to return your fake root tags with lines from the file between them.

package FakeFile;

use strict;
use warnings;

use overload '<>' => \&my_readline;

sub new {
    my $class = shift;
    my $filename  = shift;

    open my $fh, '<', $filename or die "open $filename: $!";

    return bless { fh => $fh }, $class;
}

sub my_readline {
    my $self = shift;
    return if $self->{done};

    if ( not $self->{started} ) {
        $self->{started} = 1;
        return '<fake_root_tag>';
    }

    if ( eof $self->{fh} ) {
        $self->{done} = 1;
        return '</fake_root_tag>';
    }

    return readline $self->{fh};
}


1;

This will not work if your parser expects a genuine file descriptor (for example, using something like sysread), but you might find it inspiring.

Usage example:

echo "one
two
three" > myfile
perl -MFakeFile -E 'my $f = FakeFile->new( "myfile" ); print while <$f>' 
+5
source

Source: https://habr.com/ru/post/1532169/


All Articles