How to easily parse between <div class = "foo"> and </div> in Perl

Question

How to easily parse between <div class = "foo"> and </div> in Perl

I want to analyze a website in a Perl data structure. First I load the page with

use LWP::Simple; my $html = get("http://f.oo");

Now I know two ways to handle this. The first is regular expressions and module locking.

I started by reading about HTML :: Parser and found some examples. But I'm not sure about Perl.

In my sample code

 my @links; my $p = HTML::Parser->new(); $p->handler(start => \&start_handler,"tagname,attr,self"); $p->parse($html); foreach my $link(@links){ print "Linktext: ",$link->[1],"\tURL: ",$link->[0],"\n"; } sub start_handler{ return if(shift ne 'a'); my ($class) = shift->{href}; my $self = shift; my $text; $self->handler(text => sub{$text = shift;},"dtext"); $self->handler(end => sub{push(@links,[$class,$text]) if(shift eq 'a')},"tagname"); }

I do not understand why there is a two-time shift. The sequence should be a pointer to itself. But the first one makes me think that self-consistency is an allready shiftet used as a hash, and the value for href is stored in $class . Can someone explain this line (my ($class) = shift->{href}; )?

Besides this drawback, I don’t want to parse all the URLs, I want to put all the code between <div class ="foo"> and </div> in a line where a lot of code is between, especially other <div></div> tags <div></div> . Therefore, I or the module must find the right end. After that, I again planned to scan the string to find special classes such as <h1>,<h2>, <p class ="foo2"></p> , etc.

I hope this information helps you to give me useful tips, and please keep in mind that first of all I want it to be easy to understand that there should not be a great performance at the first level!

+6

html perl parsing

froehli Dec 19 '11 at 23:03

source share

4 answers

Sinan Ünür · Answer 1 · 2011-12-19T23:32:01+0000

Use HTML :: TokeParser :: Simple .

Unverified code based on your description:

 #!/usr/bin/env perl use strict; use warnings; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(url => 'http://example.com/example.html'); my $level; while (my $tag = $p->get_tag('div')) { my $class = $tag->get_attr('class'); next unless defined($class) and $class eq 'foo'; $level += 1; while (my $token = $p->get_token) { $level += 1 if $token->is_start_tag('div'); $level -= 1 if $token->is_end_tag('div'); print $token->as_is; unless ($level) { last; } } }

ikegami · Answer 2 · 2011-12-19T23:36:59+0000

HTML :: Parser is more of a tokenizer than a parser. This leaves you with a lot of hard work. You have considered using HTML :: TreeBuilder (which uses HTML :: Parser) or XML :: LibXML (a great library that supports HTML)

tempire · Answer 3 · 2011-12-25T02:40:27+0000

No need to get so complicated. You can retrieve and find elements in the DOM using CSS selectors with Mojo :: UserAgent :

 say Mojo::UserAgent->new->get('http://f.oo')->res->dom->find('div.foo');

or by going through the found items:

 say $_ for Mojo::UserAgent->new->get('http://f.oo')->res->dom ->find('div.foo')->each;

or, a loop using a callback:

 Mojo::UserAgent->new->get('http://f.oo')->res->dom->find('div.foo')->each(sub { my ($count, $el) = @_; say "$count: $el"; });

Amadan · Answer 4 · 2011-12-19T23:11:37+0000

According to the documents, the signature of the handler (\%attr, \@attr_seq, $text) . There are three shifts: one for each argument.

 my ($class) = shift->{href};

is equivalent to:

 my $class; my %attr_seq; my $attr_seq_ref; $attr_seq_ref = shift; %attr_seq = %$attr_seq_ref; $class = $attr_seq{'href'};

How to easily parse between <div class = "foo"> and </div> in Perl

More articles: