Why is XML :: Simple discouraged?

From the XML::Simple documentation XML::Simple :

Using this module in new code is not recommended. Other modules are available that provide simpler and more consistent interfaces. In particular, XML :: LibXML is highly recommended.

The main problems with this module are a large number of options and arbitrary ways of interacting with these parameters - often with unexpected results.

Can someone clarify for me what are the main reasons for this?

+55
xml perl xml-simple
Oct 21 '15 at 19:36
source share
3 answers

The real problem is that the first thing trying to do is XML::Simple , is to take the XML and represent it as a perl data structure.

As you undoubtedly know from perldata two available data structures: hash and array .

  • Arrays are ordered scalars.
  • hashes are unordered key-value pairs.

And XML is not working. It has elements that:

  • does not have a unique name (which means that hashes do not fit).
  • .... but "ordered" inside the file.
  • may have attributes (which you could insert into the hash)
  • may have content (but cannot, but may be a unary tag)
  • can have children (any depth)

And these things do not map directly to the available perl data structures - at a simplified level, a nested hash of a hash may occur, but it cannot handle elements with duplicate names. Also, you cannot easily distinguish between attributes and child nodes.

So, XML::Simple tries to guess based on the XML content and accepts โ€œhintsโ€ from various parameter parameters, and then when you try to output the content, it (tries) to apply the same process in the reverse order.

As a result, for anything but the simplest XML, at best it becomes cumbersome or loses data in the worst case.

Consider:

 <xml> <parent> <child att="some_att">content</child> </parent> <another_node> <another_child some_att="a value" /> <another_child different_att="different_value">more content</another_child> </another_node> </xml> 

This - when parsing through XML::Simple gives you:

 $VAR1 = { 'parent' => { 'child' => { 'att' => 'some_att', 'content' => 'content' } }, 'another_node' => { 'another_child' => [ { 'some_att' => 'a value' }, { 'different_att' => 'different_value', 'content' => 'more content' } ] } }; 

Note. Now you have under parent - only anonymous hashes, but under another_node you have an array of anonymous hashes.

So, to access the contents of child :

 my $child = $xml -> {parent} -> {child} -> {content}; 

Note that you have a โ€œchildโ€ node, below it is a โ€œcontentโ€ node, which is not because it is ... content.

But to access the content under the first element of another_child :

  my $another_child = $xml -> {another_node} -> {another_child} -> [0] -> {content}; 

Please note that - due to the presence of several <another_node> elements, XML was parsed into an array where it was not with one. (If you have an element called content under it, then you end up with something else). You can change this using ForceArray , but then you get a hash of the hashes of the arrays of the arrays of the hashes of the arrays - although this is at least consistent in handling child elements. Edit: Note that after the discussion, this is a bad default, not an error with XML :: Simple.

You must install:

 ForceArray => 1, KeyAttr => [], ForceContent => 1 

If you apply this to XML as described above, you will get instead:

 $VAR1 = { 'another_node' => [ { 'another_child' => [ { 'some_att' => 'a value' }, { 'different_att' => 'different_value', 'content' => 'more content' } ] } ], 'parent' => [ { 'child' => [ { 'att' => 'some_att', 'content' => 'content' } ] } ] }; 

This will give you consistency because you will no longer have separate node elements that handle differently with multi-node.

But you still:

  • You have 5 reference deep trees to get value.

For example:

 print $xml -> {parent} -> [0] -> {child} -> [0] -> {content}; 

You still have content and child hash elements processed as if they were attributes, and since the hashes are unordered, you simply cannot restore the input. So basically, you need to parse it and then run through Dumper to find out where you need to look.

But with the xpath request, you will get node with:

 findnodes("/xml/parent/child"); 

What you don't get in XML::Simple , which you do in XML::Twig (and I assume XML::LibXML , but I know it less well):

  • xpath support. xpath is an XML way of expressing the path to a node. So you can โ€œfindโ€ the node in the above example with get_xpath('//child') . You can even use attributes in xpath - like get_xpath('//another_child[@different_att]') , which will select exactly the one you want. (You can also iterate over matches).
  • cut and paste to move items around
  • parsefile_inplace so you can change the XML with editing.
  • pretty_print to format XML .
  • twig_handlers and purge - which allows you to process really large XML without having to load it all into memory.
  • simplify if you really have to make it backward compatible with XML::Simple .
  • code is usually simpler than trying to follow chains of references to hashes and arrays, which can never be executed sequentially due to fundamental differences in structure.

It is also widely available - it downloads easily from CPAN and is distributed as an installable package on many operating systems. (Unfortunately, this is not a standard installation.)

See: XML :: Quick Link

For comparison:

 my $xml = XMLin( \*DATA, ForceArray => 1, KeyAttr => [], ForceContent => 1 ); print Dumper $xml; print $xml ->{parent}->[0]->{child}->[0]->{content}; 

Vs.

 my $twig = XML::Twig->parse( \*DATA ); print $twig ->get_xpath( '/xml/parent/child', 0 )->text; print $twig ->root->first_child('parent')->first_child_text('child'); 
+54
Oct 21 '15 at 19:36
source share

XML :: Simple is the most complex XML parser available

The main problem with XML :: Simple is that it is extremely difficult to navigate correctly with the resulting structure. $ele->{ele_name} can return any of the following (even for elements that follow the same specification):

 [ { att => 'val', ..., content => 'content' }, ... ] [ { att => 'val', ..., }, ... ] [ 'content', ... ] { 'id' => { att => 'val', ..., content => 'content' }, ... } { 'id' => { att => 'val', ... }, ... } { 'id' => { content => 'content' }, ... } { att => 'val', ..., content => 'content' } { att => 'val', ..., } 'content' 

This means that you must perform all kinds of checks to see what you actually received. But the complexity of this prompts developers to make very poor assumptions. This leads to all kinds of problems that arise in the production process, which leads to a malfunction of the working code when detecting angular situations.

Options for creating a more regular tree fall short

You can use the following options to create a more regular tree:

 ForceArray => 1, KeyAttr => [], ForceContent => 1 

But even with these parameters, a lot of checks are required to extract information from the tree. For example, getting the /root/eles/ele nodes from a document is a normal operation that should be performed trivially, but using XML :: Simple requires the following:

 # Requires: ForceArray => 1, KeyAttr => [], ForceContent => 1, KeepRoot => 0 # Assumes the format doesn't allow for more than one /root/eles. # The format wouldn't be supported if it allowed /root to have an attr named eles. # The format wouldn't be supported if it allowed /root/eles to have an attr named ele. my @eles; if ($doc->{eles} && $doc->{eles}[0]{ele}) { @eles = @{ $doc->{eles}[0]{ele} }; } 

In another parser, you can use the following:

 my @eles = $doc->findnodes('/root/eles/ele'); 

XML :: Simple imposes numerous limitations and does not have common features

  • This is completely useless for XML production. Even with ForceArray => 1, ForceContent => 1, KeyAttr => [], KeepRoot => 1 too many details that cannot be controlled.

  • This does not preserve the relative order of children with different names.

  • It has limited (with XML :: SAX backend) or not (with XML :: Parser backend) namespace support and namespace prefixes.

  • It cannot process elements with text and elements as children (which means that it cannot process XHTML, among other things).

  • Some backends (e.g. XML :: Parser) cannot handle non-ASCII encodings (e.g. UTF-16le).

  • An element cannot have a child element and attribute with the same name.

  • It cannot create XML documents with comments.

Ignoring the main issues mentioned earlier, XML :: Simple can still be used with these limitations. But why try to check if XML :: Simple can handle the format of your document, and runs the risk of switching to another parser later? You can simply use the best parser for all of your documents from the start.

Some parsers not only do not expose you to these restrictions, but also provide many other useful functions. Below are some of the functions that they may have that XML :: Simple does not have:

  • Speed. XML :: Simple is very slow, especially if you use a backend other than XML :: Parser. I speak orders of magnitude slower than other parsers.

  • XPath selectors or similar.

  • Support for extremely large documents.

  • Support for beautiful printing.

Is XML :: Simple useful?

The only format for which XML :: Simple is the simplest is one where no element is optional. I had experience with countless XML formats, and I have never come across such a format.

This fragility and complexity alone is enough to avoid XML :: Simple, but there are others.

alternatives

I am using XML :: LibXML. This is an extremely fast, full-featured parser. If I needed to process documents that could not fit in memory, I would use XML :: LibXML :: Reader (and its copyCurrentNode(1) ) or XML :: Twig (using twig_roots ).

+32
Oct 22 '15 at 4:37
source share

I do not agree with the documents

I will object and say that XML::Simple is just .. simple. And it was always easy and pleasant for me to use. Check it out with the input you get. As long as the input does not change, you are fine. Those who complain about using XML::Simple complain about using JSON::Syck to serialize Moose. Documents are erroneous because they take into account the correctness of efficiency. If only the following bothers you, you are well:

  • do not discard data
  • building with the format provided, not an abstract outline

If you create an abstract parser that is not defined by the application, but by specification, I would use something else. I worked at the company once, and we had to accept 300 different XML schemas, none of which had a specification. XML::Simple did the job easily. Other options would require us to hire someone to do the job. Everyone thinks that XML is what is sent in a hard, comprehensive, specified format, so if you write one parser, you are kind. If XML::Simple is not used in this case. XML, before JSON, was just a "dump of this and walking" format from one language to another. People really used things like XML::Dumper . No one knew what happened. Working with this XML::Simple Script XML::Simple is greattt! Saints people still discard JSON without specification to achieve the same. This is how the world works.

Want to read the data and not worry about the format? Want to cross Perl structures, not XML features? Go XML::Simple .

By extension...

Likewise, for most JSON::Syck , dropping it and walking is enough. Although, if you send a lot of people, I would suggest you not to use a shower nozzle and make a specification for which you export. But, you know, that .. Someday you will receive a call from someone you donโ€™t want to talk to, who wants his data not to be exported normally. And you are going to pass it through JSON::Syck voodoo and let them worry about it. If they want XML? Charge them for another $ 500 and run ole XML::Dumper .

Take away

It may be less perfect, but XML::Simple efficient. Every hour saved on this arena, you can spend on a more useful arena. This is a real worldview.

Other answers

Look XPath has some problems. Each answer here comes down to XPath over Perl. It's great. If you prefer to use a standardized XML language to access your XML, keep this in mind!

Perl does not provide a simple mechanism for accessing deeply nested additional structures.

 var $xml = [ { foo => 1 } ]; ## Always w/ ForceArray. var $xml = { foo => 1 }; 

Getting the foo value here in these two contexts can be tricky. XML::Simple knows this and why you can force the first. However, even with ForceArray , if an element is missing, you will throw an error.

 var $xml = { bar => [ { foo => 1 } ] }; 

now, if bar is optional, you will get access to it $xml->{bar}[0]{foo} , and @{$xml->{bar}}[0] will throw an error. Anyway, it's just pearl. This is due to XML::Simple imho. And I admitted that XML::Simple not suitable for building a specification. Show me the data and I can access it using XML :: Simple.

+4
Oct 22 '15 at 16:23
source share



All Articles