Perl, XML :: Twig, how to read a field with the same tag

I am working on the processing of an XML file that I receive from a partner. I do not affect the composition of this xml file. XML Extract:

<?xml version="1.0" encoding="UTF-8"?> <objects> <object> <id>VW-XJC9</id> <name>Name</name> <type>House</type> <description> <![CDATA[<p>some descrioption of the house</p>]]> </description> <localcosts> <localcost> <type>mandatory</type> <name>What kind of cost</name> <description> <![CDATA[Some text again, different than the first tag]]> </description> </localcost> </localcosts> </object> </objects> 

The reason I use Twig is because this XML is about 11 GB, about 100,000 different objects). The problem is that when I get to the localcosts part, 3 fields (type, name and description) are skipped, possibly because these names are already used before.

The code that I use to go through the xml file is as follows:

 my $twig= new XML::Twig( twig_handlers => { id => \&get_ID, name => \&get_Name, type => \&get_Type, description => \&get_Description, localcosts => \&get_Localcosts }); $lokaal="c:\\temp\\data3.xml"; getstore($xml, $lokaal); $twig->parsefile("$lokaal"); sub get_ID { my( $twig, $data)= @_; $field[0]=$data->text; $twig->purge; } sub get_Name { my( $twig, $data)= @_; $field[1]=$data->text; $twig->purge; } sub get_Type { my( $twig, $data)= @_; $field[3]=$data->text; $twig->purge; } sub get_Description { my( $twig, $data)= @_; $field[8]=$data->text; $twig->purge; } sub get_Localcosts{ my ($t, $item) = @_; my @localcosts = $item->children; for my $localcost ( @localcosts ) { print "$field[0]: $localcost->text\n"; my @costs = $localcost->children; for my $cost (@costs) { $Type =$cost->text if $cost->name eq q{type}; $Name =$cost->text if $cost->name eq q{name}; $Description=$cost->text if $cost->name eq q{description}; print "Fields: $Type, $Name, $Description\n"; } } $t->purge; } 

when I run this code, the main fields are read without problems, but when the code goes to the "localcosts" part, the second for-next loop fails. When I change the field names in xml to unique, this code works fine.

Can someone help me?

thanks

+6
source share
3 answers

If you want handlers for type, name, and description to run only in the object tag, specify the path:

 my $twig = new XML::Twig( twig_handlers => { id => \&get_ID, 'object/name' => \&get_Name, 'object/type' => \&get_Type, 'object/description' => \&get_Description, localcosts => \&get_Localcosts }); 
+4
source

The problem is that id , name , type and description handlers are executed for both occurrences. You will find that the contents of @fields matches @fields values, as the data from object values ​​has been overwritten.

In addition, when processing localcost elements localcost handlers cleaned up $ twig->, which removes data from memory. Therefore, when the localcosts handler localcosts called, it finds the item empty

I think the easiest way to do this is to write one handler that processes each object node at a time and then cleans it

This program demonstrates. Note that I used Data::Dumper only so that you can see the contents of @fields after it has been filled

It is very important that you use strict and use warnings at the top of every Perl program, especially if you ask for help. This is a simple measure that can reveal many simple errors that you otherwise might spend a lot of time searching.

Note also that the form of an "indirect object" of method calls is not recommended: you should write XML::Twig->new(...) instead of new XML::Twig (...) .

And if you use single quotes instead of double quotes, the backslash inside the string should not be doubled unless it is the last character of the string. But Perl is very happy if you use slashes as a path separator, even on Windows

I hope this helps

 use strict; use warnings; use XML::Twig; use Data::Dumper; $Data::Dumper::Useqq = 1; my $twig= XML::Twig->new( twig_handlers => { object => \&get_Object }); my $lokaal = 'c:\temp\data3.xml'; my @fields; $twig->parsefile($lokaal); sub get_Object { my ($twig, $object) = @_; $fields[0] = $object->findvalue('id'); $fields[1] = $object->findvalue('name'); $fields[3] = $object->findvalue('type'); $fields[8] = $object->findvalue('description'); print Dumper \@fields; my @localcosts = $object->findnodes('localcosts/localcost'); for my $localcost (@localcosts) { my $type = $localcost->findvalue('type'); my $name = $localcost->findvalue('name'); my $description = $localcost->findvalue('description'); print "$type, $name, $description\n"; } $twig->purge; } 

Output

 $VAR1 = [ "VW-XJC9", "Name", undef, "House", undef, undef, undef, undef, "<p>some descrioption of the house</p> " ]; mandatory, What kind of cost, Some text again, different than the first tag 
+4
source

As Borodin said, if you have handlers on name , type and description , and you call $twig->purge at the end of each handler, then the elements are removed from the tree. You can set the handler to object , which only calls $twig->purge , and you'll be fine.

You don’t need to call purge too often, just make sure you call it low enough not to use too much memory. It makes no sense to call it for each element of the sheet.

This is a common mistake that I make quite often: - (.

+2
source

Source: https://habr.com/ru/post/970539/


All Articles