Should I use HTML :: Parser or XML :: Parser to extract and replace text?

I am considering the possibility of extracting all plain text and parsing / making changes from an HTML / XHTML document, and then replacing it if necessary. Can I do this using HTML :: Parser or should there be XML :: Parser ?

Are there any good demos that everyone knows about?

+4
source share
4 answers

You should also look at Web :: Scraper .
I find this module easier than HTML :: Parser modules, but it helps if you are familiar with XPath.
HTML analysis is very unpredictable depending on the actual pages - it looks like a pdf display, not data.

+3
source

The HTML :: Parser approach is based on tokens and callbacks. It’s very convenient for me when you have especially difficult conditions in the context in which the data that you want to extract or modify occurs.

Otherwise, I prefer a tree-based approach. HTML :: TreeBuilder :: XPath (based solely on HTML :: Parser) allows you to find nodes with XPath. It returns HTML :: Element s. The documentation is a bit sparse (well, extends to a couple of modules). But still a quick way to turn it into HTML.

If you are dealing with pure XML, XML :: Twig is an outstanding parser: very good memory management, allows you to combine tree and stream. And the documentation is very good.

+4
source

Tell someone on the StackOverflow user page that you want to replace all PERL instances with Perl. You can do it with

#! /usr/bin/perl use warnings; use strict; use HTML::Parser; use LWP::Simple; my $html = get "http://stackoverflow.com/users/201469/phil-jackson"; die "$0: get failed" unless defined $html; sub replace_text { my($skipped,$markup) = @_; $skipped =~ s/\bPERL\b/Perl/g; print $skipped, $markup; } my $p = HTML::Parser->new( api_version => 3, marked_sections => 1, case_sensitive => 1, unbroken_text => 1, xml_mode => 1, start_h => [ \&replace_text => "skipped_text, text" ], end_h => [ \&replace_text => "skipped_text, text" ], ); # your page may use a different encoding binmode STDOUT, ":utf8" or die "$0: binmode: $!"; $p->parse($html); 

The result is what we expect:

  $ wget -O phil-jackson.html http://stackoverflow.com/users/201469
 $ ./replace-text> out.html
 $ diff -ub phil-jackson.html out.html
 --- phil-jackson.html
 +++ out.html
 @@ -327.7 +327.7 @@

  Perl:  

 - # $ linkTrue = & hellip;  "> comparing PERL md5 () and PHP md5 () </a> </h3>
 + # $ linkTrue = & hellip;  "> comparing Perl md5 () and PHP md5 () </a> </h3>

          <div class = "tags t-php t-perl t-md5">
              <a href="/questions/tagged/php" class="post-tag" title="show questions tagged'php'" rel="tag"> php </a> <a href = "/ questions / tagged / perl "class =" post-tag "title =" show questions tagged 'perl' "rel =" tag "> perl </a> <a href =" / questions / tagged / md5 "class =" post-tag "title = "show questions tagged 'md5'" rel = "tag"> md5 </a> 

"PERL:" the sore finger is part of the attribute of the element, not the text section.

+3
source

Which module you should use depends on what you are trying to do. For starters, HTML :: Parser contains great examples that also include a script that extracts plain text from an HTML document.

Don't try to parse HTML documents with an XML parser: you will find yourself in a world of pain, as many valid HTML constructs are invalid XML.

Do not try to parse XML documents using the HTML parser: you will lose all the benefits of a stricter requirement that the XML document be well-formed before it can be parsed.

+1
source

Source: https://habr.com/ru/post/1300593/


All Articles