Remove html code between two comments using perl

Question

Remove html code between two comments using perl

Let's say I have html, which is from a file that I save in a line using perl, which looks like this

<tbody> <tr> <td width="650"> <!--MyComment--> <a href="http://myurl.com"><img src="myimage.png" > </a> <!--MyComment--> </td> </tr> </tbody> ... ... ...

What would be the best way to remove HTML between two comments I was thinking about using the HTML :: tree perl module

+4

perl

user2429569 Jun 23 '13 at 9:00

source share

2 answers

Birei · Answer 1 · 2013-06-23T11:39:12+0000

One option is to use pull parsing. Here you have an example with HTML::TokeParser . It uses two loops, the first is used for the first appearance of your comment. It prints every tag that it finds until it appears. The second one goes through all the tags until the second comment appears and prints nothing.

The contents of script.pl :

 #!/usr/bin/env perl use warnings; use strict; use HTML::TokeParser; my $p = HTML::TokeParser->new ( shift ); while ( my $token = $p->get_token ) { printf qq|%s|, $token->[0] =~ m/S|E|PI/ ? $token->[ $#$token ] : $token->[1]; if ( $token->[0] eq q|C| && $token->[1] =~ m/(?i)MyComment/ ) { ## Here begins the comment. while ( my $token2 = $p->get_token ) { if ( $token2->[0] eq q|C| && $token2->[1] =~ m/(?i)MyComment/ ) { ## Here ends the comment. printf qq|%s|, $token2->[1]; last; } } } }

Run it like this:

 perl script.pl htmlfile

This gives:

 <html> <head> <title>Title</title> </head> <body> <tbody> <tr> <td width="650"> <!--MyComment--><!--MyComment--> </td> </tr> </tbody> </body> </html>

oalders · Answer 2 · 2013-06-24T04:33:56+0000

You can also do this with HTML :: Restrict , which removes default comments. The caveat is that with HTML :: Restrict you need to explicitly specify all the HTML elements and attributes that you want to keep. If you just want to delete comments, this is probably not the right module for you, but if there are other elements that need to be removed when you are on it, it might be worth a look.

Remove html code between two comments using perl

More articles: