Removing everything except html tags using perl

Question

Removing everything except html tags using perl

I was looking for a way to remove everything from an html document, leaving ONLY the html tags. Does anyone know of a method for this? I have experience with many perl modules and have carefully studied this site.

I want to pass html as a string to my perl script and delete everything except the tags. Here is an example:

Inbox:

<!doctype html> <html> <head> <title>Example Domain</title> <meta charset="utf-8" /> <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <style type="text/css"> body { background-color: #f0f0f2; margin: 0; padding: 0; font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; } div { width: 600px; margin: 5em auto; padding: 50px; background-color: #fff; border-radius: 1em; } a:link, a:visited { color: #38488f; text-decoration: none; } @media (max-width: 700px) { body { background-color: #fff; } div { width: auto; margin: 0 auto; border-radius: 0; padding: 1em; } } </style> </head> <body> <div> website content .... </div> </body> </html>

becomes:

 <html><head><title></title><meta><meta><meta><style></style></head><body><div><h1></h1> <p></p><p><a></a></p></div></body></html>

+6

html regex perl

user2421267 May 26 '13 at 12:08

source share

3 answers

optional · Answer 1 · 2013-05-26T02:24:42+0000

 #!/usr/bin/perl -- use strict; use warnings; use XML::Twig; Main( @ARGV ); exit( 0 ); sub Main { if( @_ ){ nothing_but_tags("$_") for @_; } else { nothing_but_tags(q{<NoTe KunG="FoO" ChOp="SuEy"> NoteKungFo0Ch0pSuEy <To KunG="FoO">ToKungFo0 <Person KunG="FoO">Satan</Person> </To> <Beef KunG="FoO"> BeefKunGFoO <SaUsAGe KunG="FoO">is Tasty </SaUsAGe> </Beef> </NoTe>}, ); } } sub nothing_but_tags { my( $input, %opt ) = @_; $opt{pretty_print} ||= 'indented' ; my $t = XML::Twig->new( %opt, force_end_tag_handlers_usage => 1, start_tag_handlers => { _all_ => sub { if( $_->has_atts ){ $_->set_atts ({}); } return; }, }, end_tag_handlers => { _all_ => sub { $_->flush; return }, }, char_handler => sub { '' }, ); $t->xparse( $_[0] ); $t->flush(); (); } __END__ <NoTe> <To> <Person></Person> </To> <Beef> <SaUsAGe></SaUsAGe> </Beef> </NoTe>

nwellnhof · Answer 2 · 2013-06-03T13:15:42+0000

This conversion is very simple with XSLT, so here is an example of using XML :: LibXSLT.

 #!/usr/bin/perl use strict; use XML::LibXML; use XML::LibXSLT; my $filename = $ARGV[0] or die("Usage: $0 filename\n"); my $doc = XML::LibXML->load_html(location => $filename); my $stylesheet_doc = XML::LibXML->load_xml(string => <<'EOF'); <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="*"> <xsl:copy> <xsl:apply-templates select="*"/> </xsl:copy> </xsl:template> </xsl:stylesheet> EOF my $xslt = XML::LibXSLT->new; my $stylesheet = $xslt->parse_stylesheet($stylesheet_doc); my $result = $stylesheet->transform($doc); print $result->serialize_html;

Albi patozi · Answer 3 · 2013-06-03T13:20:00+0000

I don’t know if I understood your question correctly, but to leave JUST THE TAGS, you can take the output from strip tags (only tags with strip), and then replace this output with zero in the source text. Theoretically, the first function will give you the exact text that is outside the tags, and the next step will replace that text with zero.

Removing everything except html tags using perl

More articles: