How to parse invalid HTML with Perl?

I maintain an HTML formatted article database. Unfortunately, the editors who wrote the articles did not know the proper HTML, so they often write things like:

<div class="highlight"><html><head></head><body><p>Note that ...</p></html></div> 

I tried using HTML::TreeBuilder to parse this HTML code, but after parsing it and dropping the resulting tree, all the elements between <div class="highlight">...</div> disappeared. All I have left is <div class="highlight"></div> .

Editors often also did things like:

 <div class="article"><style>@font-face { font-family: "Cambria"; }</style>Article starts here</div> 

HTML::TreeBuilder with HTML::TreeBuilder will again result in an empty <div class="article"></div> .

Any ideas on how to approach this broken HTML and is it really possible?

+6
source share
4 answers

First I ran it through HTML :: Tidy :

 #!/usr/bin/env perl use strict; use warnings; use HTML::Tidy; my $html = <<EO_HTML; <div class="highlight"><html><head></head> <body><p>Note that ...</p></html> </div> EO_HTML my $tidy = HTML::Tidy->new; print $tidy->clean( $html ); 

Conclusion:

 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> <html> <head> <meta name="generator" content="tidyp for Windows (v1.04), see www.w3.org"> <title></title> </head> <body> <div class="highlight"> <p>Note that ...</p> </div> </body> </html> 

You can control the output by setting various configuration parameters.

Then load the cleaned HTML via the parser.

Otherwise, you can try to build the tree step by step using HTML :: TokeParser :: Simple or even just HTML :: Parser , but I think the path is crazy.

Keep in mind that a parser trying to create a tree view will be more strict than a stream parser that simply recognizes the various elements when it sees them.

+11
source

You can try using Marpa :: HTML , which is a high-level HTML parser that allows extremely liberal parsing. He can parse even invalid HTML using his method called ruby ​​slippers by his author; Marpa :: HTML adds an element that should be there.

See an example of reformatting, prefixing, and accepting an example of invalid HTML in How to parse HTML in a blog post by Jeffrey Kegler, author of Marpa Parser and Marpa :: HTML.

+3
source

XML :: LibXML is also, surprisingly, well suited for this kind of cleanup when used properly. It is also very fast; and deep / flexible once you pass your learning curve.

 #!/usr/bin/env perl use strictures; use XML::LibXML; my @craptastic = ( '<div class="article"><style>@font-face{ font-family: "Cambria" }</style>Article starts here</div>', '<div class="highlight"><html><head></head><body><p>Note that ...</p></html></div>' ); # The inline setting of recover_silently is broken/non-functional so # we do the method calls to set. my $parser = XML::LibXML->new(); $parser->recover_silently(1); $parser->keep_blanks(1); for my $crap ( @craptastic ) { my $doc = $parser->load_html( string => $crap ); # Optional example for killing style tags not in the <head/> $_->parentNode->removeChild($_) for $doc->findnodes("//body//style"); print $/, $crap, $/; my ( $body ) = $doc->findnodes("//body"); print "-" x 60, $/; print $_->serialize(1) for $body->childNodes; print $/, $/; } 

Gives you:

 <div class="article"><style>@font-face{ font-family: "Cambria" }</style>Article starts here</div> ------------------------------------------------------------ <div class="article">Article starts here</div> <div class="highlight"><html><head></head><body><p>Note that ...</p></html></div> ------------------------------------------------------------ <div class="highlight"> <p>Note that ...</p> </div> 
+1
source

Sounds like Tag soup . As another approach, you can also use the " html-tagsoup " java program from your perl program (for example, with reverse windows). It can be called an autonomous program like this.

 java -jar tagsoup-1.2.1 [option ...] [file ...] 

HTML :: Tydi used to be better or more flexible, I think.

-1
source

Source: https://habr.com/ru/post/919705/


All Articles