You have two problems:
So, for (2) I would look at something like Anemone , which will make it easier for you to scan full websites:
Anemone is a Ruby library that allows you to quickly and painlessly write programs that distribute a website. It provides a simple DSL to perform actions on each page of the site, skips specific URLs and calculates the shortest path to this page on the site.
Multi-threaded design makes Anemone fast. The API makes it simple. And the expressiveness of Ruby makes it powerful.
For a simple workaround, Anemone will even give you an array of all the links on the page, so you don't necessarily need Nokogiri. For more complex things, you might want to combine Anemone with something like Mechanize and Nokogiri. It depends on your requirements.
source share