DRY search every page of a site with nokogiri

Question

DRY search every page of a site with nokogiri

I want to search every page of the site. My idea is to find all the links on the page that remain within the domain, visit them and repeat. I will have to take steps to not try again.

So it starts off very easily:

page = 'http://example.com' nf = Nokogiri::HTML(open(page)) links = nf.xpath '//a' #find all links on current page main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq

"main_links" is now an array of links from the active page starting with "/" (which should be links only to the current domain).

From here I can feed and read these links into the similar code above, but I don’t know what is the best way to ensure that I don’t repeat myself. I think that I am starting to collect all the links visited when I visit them:

 main_links.each do |ml| visited_links = [] #new array of what is visted np = Nokogiri::HTML(open(page + ml)) #load the first main_link visted_links.push(ml) #push the page we're on np_links = np.xpath('//a').map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq #grab all links on this page pointing to the current domain main_links.push(np_links).compact.uniq #remove duplicates after pushing? end

I'm still developing this last bit ... but does this seem like the right approach?

Thanks.

+4

ruby web-crawler dry web-scraping nokogiri

twinturbotom Jun 11 '13 at 2:06

source share

3 answers

You are missing some things.

A local link may begin with / , but it may also begin with . , .. or even without a special character, that is, the link is in the current directory.

JavaScript can also be used as a link, so you will need to search the entire document and find the tags used as buttons, and then parse the URL.

It:

 links = nf.xpath '//a' #find all links on current page main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq

it might be better written:

 links.search('a[href^="/"]').map{ |a| a['href'] }.uniq

In general, do not do this:

 ....map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq

because it is very inconvenient. The conditional value in map results in nil entries in the resulting array, so don't do this. Use select or reject to reduce the number of links matching your criteria, and then use map to convert them. In your use here, pre-filtering using ^= in CSS makes it even easier.

Do not store links in memory. You will lose all progress if you break or stop your code. Instead, at a minimum, use something like a SQLite database on disk as a data store. Create a "href" field that is unique in order to avoid multiple hits on the same page.

Use the built-in Ruby URI class or Addressable gem to parse and manage URLs. They save your work and will do everything right when you start encoding / decoding queries and try to normalize parameters to check uniqueness, extraction and path management, etc.

Many sites use session identifiers in the URL request to identify the visitor. This identifier can make each link different if you start, then stop, then start again or if you do not return the cookies received from the site, so you need to return the cookies and find out which request parameters are significant and which are going to reset your code. Save the first and discard the second when you save the links for later parsing.

Use an HTTP client such as Typhoeus with Hydra to simultaneously receive multiple pages and store them in your database using a separate process that parses them and passes the URLs for analysis to the database. This can significantly affect the overall processing time.

Read the website robots.txt file and reduce your requests to avoid beating your server. Nobody likes banded pigs and consumes a significant part of the site’s bandwidth or processor time without permission - this is a good way to be noticed and then banned. At this point, your site will operate with zero bandwidth.

+3

the tin man Jun 11 '13 at 3:09

source share

This is a more complex problem than you seem to understand. Using the library with Nokogiri is probably the way to go. If you are not using windows (like me), you can look in Anemone .

+1

pguardiario Jun 11 '13 at 3:00

source share

Phrogz · Accepted Answer · 2013-06-12T03:07:20+0000

Others advise you not to write your own web crawler. I agree with your goals and performance. However, this can be a great training exercise. You wrote this:

"[...] but I don’t know how best to make sure that I don’t repeat myself"

Recursion is the key here. Something like the following code:

 require 'set' require 'uri' require 'nokogiri' require 'open-uri' def crawl_site( starting_at, &each_page ) files = %w[png jpeg jpg gif svg txt js css zip gz] starting_uri = URI.parse(starting_at) seen_pages = Set.new # Keep track of what we've seen crawl_page = ->(page_uri) do # A re-usable mini-function unless seen_pages.include?(page_uri) seen_pages << page_uri # Record that we've seen this begin doc = Nokogiri.HTML(open(page_uri)) # Get the page each_page.call(doc,page_uri) # Yield page and URI to the block # Find all the links on the page hrefs = doc.css('a[href]').map{ |a| a['href'] } # Make these URIs, throwing out problem ones like mailto: uris = hrefs.map{ |href| URI.join( page_uri, href ) rescue nil }.compact # Pare it down to only those pages that are on the same site uris.select!{ |uri| uri.host == starting_uri.host } # Throw out links to files (this could be more efficient with regex) uris.reject!{ |uri| files.any?{ |ext| uri.path.end_with?(".#{ext}") } } # Remove #foo fragments so that sub-page links aren't differentiated uris.each{ |uri| uri.fragment = nil } # Recursively crawl the child URIs uris.each{ |uri| crawl_page.call(uri) } rescue OpenURI::HTTPError # Guard against 404s warn "Skipping invalid link #{page_uri}" end end end crawl_page.call( starting_uri ) # Kick it all off! end crawl_site('http://phrogz.net/') do |page,uri| # page here is a Nokogiri HTML document # uri is a URI instance with the address of the page puts uri end

In short:

Keep track of which pages you saw with Set . Do this not by href value, but by the full canonical URI.
Use URI.join to turn possible relative paths into the correct URI relative to the current page.
Use recursion to constantly crawl every link on every page, but escaping if you've already seen the page.

DRY search every page of a site with nokogiri

More articles: