Should I use HTML :: Parser or XML :: Parser to extract and replace text?

Question

Should I use HTML :: Parser or XML :: Parser to extract and replace text?

I am considering the possibility of extracting all plain text and parsing / making changes from an HTML / XHTML document, and then replacing it if necessary. Can I do this using HTML :: Parser or should there be XML :: Parser ?

Are there any good demos that everyone knows about?

+4

html xml perl parsing

Phil jackson Feb 08 '10 at 8:59

source share

4 answers

The HTML :: Parser approach is based on tokens and callbacks. It’s very convenient for me when you have especially difficult conditions in the context in which the data that you want to extract or modify occurs.

Otherwise, I prefer a tree-based approach. HTML :: TreeBuilder :: XPath (based solely on HTML :: Parser) allows you to find nodes with XPath. It returns HTML :: Element s. The documentation is a bit sparse (well, extends to a couple of modules). But still a quick way to turn it into HTML.

If you are dealing with pure XML, XML :: Twig is an outstanding parser: very good memory management, allows you to combine tree and stream. And the documentation is very good.

+4

i-blis Feb 08 '10 at 14:14

source share

Tell someone on the StackOverflow user page that you want to replace all PERL instances with Perl. You can do it with

#! /usr/bin/perl use warnings; use strict; use HTML::Parser; use LWP::Simple; my $html = get "http://stackoverflow.com/users/201469/phil-jackson"; die "$0: get failed" unless defined $html; sub replace_text { my($skipped,$markup) = @_; $skipped =~ s/\bPERL\b/Perl/g; print $skipped, $markup; } my $p = HTML::Parser->new( api_version => 3, marked_sections => 1, case_sensitive => 1, unbroken_text => 1, xml_mode => 1, start_h => [ \&replace_text => "skipped_text, text" ], end_h => [ \&replace_text => "skipped_text, text" ], ); # your page may use a different encoding binmode STDOUT, ":utf8" or die "$0: binmode: $!"; $p->parse($html);

The result is what we expect:

  $ wget -O phil-jackson.html http://stackoverflow.com/users/201469
 $ ./replace-text> out.html
 $ diff -ub phil-jackson.html out.html
 --- phil-jackson.html
 +++ out.html
 @@ -327.7 +327.7 @@

  Perl:  

 - # $ linkTrue = & hellip;  "> comparing PERL md5 () and PHP md5 () </a> </h3>
 + # $ linkTrue = & hellip;  "> comparing Perl md5 () and PHP md5 () </a> </h3>

          <div class = "tags t-php t-perl t-md5">
              <a href="/questions/tagged/php" class="post-tag" title="show questions tagged'php'" rel="tag"> php </a> <a href = "/ questions / tagged / perl "class =" post-tag "title =" show questions tagged 'perl' "rel =" tag "> perl </a> <a href =" / questions / tagged / md5 "class =" post-tag "title = "show questions tagged 'md5'" rel = "tag"> md5 </a>

"PERL:" the sore finger is part of the attribute of the element, not the text section.

+3

Greg bacon Feb 08 '10 at 15:20

source share

Which module you should use depends on what you are trying to do. For starters, HTML :: Parser contains great examples that also include a script that extracts plain text from an HTML document.

Don't try to parse HTML documents with an XML parser: you will find yourself in a world of pain, as many valid HTML constructs are invalid XML.

Do not try to parse XML documents using the HTML parser: you will lose all the benefits of a stricter requirement that the XML document be well-formed before it can be parsed.

+1

Sinan Ünür Feb 08 '10 at 17:54

source share

weismat · Accepted Answer · 2010-02-08T09:35:39+0000

You should also look at Web :: Scraper .
I find this module easier than HTML :: Parser modules, but it helps if you are familiar with XPath.
HTML analysis is very unpredictable depending on the actual pages - it looks like a pdf display, not data.

Should I use HTML :: Parser or XML :: Parser to extract and replace text?

More articles: