Perl WWW :: Mechanize a web spider. How to find all the links

Question

Perl WWW :: Mechanize a web spider. How to find all the links

I am currently trying to create a Perl web server using WWW :: Mechanize.

I am trying to create a webspider that will crawl the entire website URL (entered by the user) and extract all the links from each page . > on the site.

What I still have:

use strict; use WWW::Mechanize; my $mech = WWW::Mechanize->new(); my $urlToSpider = $ARGV[0]; $mech->get($urlToSpider); print "\nThe url that will be spidered is $urlToSpider\n"; print "\nThe links found on the url starting page\n"; my @foundLinks = $mech->find_all_links(); foreach my $linkList(@foundLinks) { unless ($linkList->[0] =~ /^http?:\/\//i || $linkList->[0] =~ /^https?:\/\//i) { $linkList->[0] = "$urlToSpider" . $linkList->[0]; } print "$linkList->[0]"; print "\n"; }

What does he do:

1. It will currently retrieve and list all links on the start page

2. If the links found are in / contact -us or / help format, he will add “http://www.thestartingurl.com” to the beginning so that it becomes “http: .// www.thestartingurl.com/contact-us

Problem:

At the moment, he also finds links to external sites that I do not want to do, for example, if I want to use "http://www.tree.com", he will find links such as http://www.tree.com / find-us . However, he will also find links to other sites, such as http://www.hotwire.com .

How do I stop searching for external URLs?

After finding all the URLs on the page, I also want to save this new list of internal links only for the new array named @internalLinks, but cannot make it work.

Any help is greatly appreciated, thanks in advance.

+4

perl hyperlink web-crawler mechanize

perl-user Oct 30 '12 at 22:29

source share

1 answer

Robarl · Accepted Answer · 2012-10-31T07:27:21+0000

This should do the trick:

 my @internalLinks = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);

If you don't want css links to try:

 my @internalLinks = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/, tag => 'a');

In addition, the regular expression that you use to add the domain to any relative links can be replaced by:

 print $linkList->url_abs();

Perl WWW :: Mechanize a web spider. How to find all the links

More articles: