I want to process all links, but external from the whole website. Is there an easy way to determine if a link is external and skip it?
My code looks like this (site URL is passed via command line argument)
I am using mechanize (0.9.3) and ruby 1.8.6 (2008-08-11 patchlevel 287) [i386-mswin32]
Please note that the website can use a relative path so that there is no host / domain, and it makes it more complex.
require 'mechanize'
def process_page(page)
puts
puts page.title
STDIN.gets
page.links.each do |link|
process_page($agent.get(link.href))
end
end
$agent = WWW::Mechanize.new
$agent.user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.4) Gecko/20091016 Firefox/3.5.4'
process_page($agent.get(ARGV[0]))
source
share