How does Perl XML :: Simple ignore HTML embedded in XML?

Question

How does Perl XML :: Simple ignore HTML embedded in XML?

I have an XML file that I extract from the Internet and understand. One element in XML is the "content" value, which contains HTML. I am using XML :: Simple :: XMLin to parse the file:

$xml= eval { $data->XMLin($xmldata, forcearray => 1, suppressempty=> +'') };

When I use Data::Dumpera hash dump, I find myself SimpleXMLparsing HTML in a hash tree:

'content' => {
      'div' => [
                 {
                   'xmlns' => 'http://www.w3.org/1999/xhtml',
                   'p' => [
                       {
                         'a' => [
                             {
                                'href' => 'http://miamiherald.typepad.com/.a/6a00d83451b26169e20133ec6f4491970b-pi',
                               'style' => 'FLOAT: left',
                               'img' => [
                                   etc .....

This is not what I want. I want to just grab the content inside this post. How should I do it?

+3

xml perl parsing

Miriam P. Raphael Apr 14 '10 at 20:05

source share

4 answers

#!/usr/bin/perl

use strict; use warnings;

use XML::LibXML::Reader;
my $reader = XML::LibXML::Reader->new(IO => \*DATA)
    or die "Cannot read XML\n";

if ( $reader->nextElement('content') ) {
    print $reader->readInnerXml;
}

__DATA__
<content>
<div xmlns="http://www.w3.org/1999/xhtml">
<p><a href="http://miamiherald.typepad.com/" style="float:left"><img
src="tada"/></a></p>
</div>
</content>

:

<div xmlns="http://www.w3.org/1999/xhtml">
<p><a href="http://miamiherald.typepad.com/" style="float:left"><img src="tada"/
></a></p>
</div>

+3

Sinan Ünür 15 . '10 10:29

HTML XML ( , CDATA), XML:: Simple, , .

HTML, XML::Simple XMLout().

+2

marnanel 14 . '10 20:34

HTML CDATA , .

Before processing with XML :: Simple, find the contents of the tag <my_html>, supposedly suspicious HTML, and pass it through the HTML object encoder ("<" => "& lt", etc.). like HTML :: Entities. Then insert the encoded content in place of the original content of the tag <my_html>.

This is VERY hacked, VERY easy to do wrong if you don’t know 100% what you are doing with regular expressions and should not be done.

Having said that, he will solve your problem.

0

DVK Apr 14 '10 at 20:38

source share

brian d foy · Accepted Answer · 2010-04-16T04:19:00+0000

, XML:: Simple , XML. XML::Simple , . , , , kludgey XML::Simple.

How does Perl XML :: Simple ignore HTML embedded in XML?

More articles: