Process all links except external (ruby + mechanize)

I want to process all links, but external from the whole website. Is there an easy way to determine if a link is external and skip it?

My code looks like this (site URL is passed via command line argument)

I am using mechanize (0.9.3) and ruby 1.8.6 (2008-08-11 patchlevel 287) [i386-mswin32]

Please note that the website can use a relative path so that there is no host / domain, and it makes it more complex.

require 'mechanize'

def process_page(page) 
  puts
  puts page.title
  STDIN.gets
  page.links.each do |link|
process_page($agent.get(link.href))
  end
end

$agent = WWW::Mechanize.new 
$agent.user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.4) Gecko/20091016 Firefox/3.5.4'
process_page($agent.get(ARGV[0]))
+3
source share
2 answers

URI , - , URL- .

URI.route_to():

require 'uri'

URI.parse('/main.rbx?page=1').host # => nil
URI.parse('main.rbx?page=1').host  # => nil

URL- , URL- , . , .

URL-, , , URL- , , .

uri = URI.parse('http://my.example.com')

uri.route_to('http://my.example.com/main.rbx?page=1').host  # => nil
uri.route_to('http://another.com/main.rbx?page=1').host # => "another.com"

, , URL-. , , .

URI, ; route_to(), URL. .host, .

+7

uri :

  page.links.each do |link|
     next unless link.uri.host.match(/(www\.)?thissite\.com/)
     process_page($agent.get(link.href))
  end
+1

Source: https://habr.com/ru/post/1742920/


All Articles