I am currently trying to create a Perl web server using WWW :: Mechanize.
I am trying to create a webspider that will crawl the entire website URL (entered by the user) and extract all the links from each page . > on the site.
What I still have:
use strict; use WWW::Mechanize; my $mech = WWW::Mechanize->new(); my $urlToSpider = $ARGV[0]; $mech->get($urlToSpider); print "\nThe url that will be spidered is $urlToSpider\n"; print "\nThe links found on the url starting page\n"; my @foundLinks = $mech->find_all_links(); foreach my $linkList(@foundLinks) { unless ($linkList->[0] =~ /^http?:\/\//i || $linkList->[0] =~ /^https?:\/\//i) { $linkList->[0] = "$urlToSpider" . $linkList->[0]; } print "$linkList->[0]"; print "\n"; }
What does he do:
1. It will currently retrieve and list all links on the start page
2. If the links found are in / contact -us or / help format, he will add “http://www.thestartingurl.com” to the beginning so that it becomes “http: .// www.thestartingurl.com/contact-us
Problem:
At the moment, he also finds links to external sites that I do not want to do, for example, if I want to use "http://www.tree.com", he will find links such as http://www.tree.com / find-us . However, he will also find links to other sites, such as http://www.hotwire.com .
How do I stop searching for external URLs?
After finding all the URLs on the page, I also want to save this new list of internal links only for the new array named @internalLinks, but cannot make it work.
Any help is greatly appreciated, thanks in advance.
source share