You are missing some things.
A local link may begin with / , but it may also begin with . , .. or even without a special character, that is, the link is in the current directory.
JavaScript can also be used as a link, so you will need to search the entire document and find the tags used as buttons, and then parse the URL.
It:
links = nf.xpath '//a' #find all links on current page main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
it might be better written:
links.search('a[href^="/"]').map{ |a| a['href'] }.uniq
In general, do not do this:
....map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
because it is very inconvenient. The conditional value in map results in nil entries in the resulting array, so don't do this. Use select or reject to reduce the number of links matching your criteria, and then use map to convert them. In your use here, pre-filtering using ^= in CSS makes it even easier.
Do not store links in memory. You will lose all progress if you break or stop your code. Instead, at a minimum, use something like a SQLite database on disk as a data store. Create a "href" field that is unique in order to avoid multiple hits on the same page.
Use the built-in Ruby URI class or Addressable gem to parse and manage URLs. They save your work and will do everything right when you start encoding / decoding queries and try to normalize parameters to check uniqueness, extraction and path management, etc.
Many sites use session identifiers in the URL request to identify the visitor. This identifier can make each link different if you start, then stop, then start again or if you do not return the cookies received from the site, so you need to return the cookies and find out which request parameters are significant and which are going to reset your code. Save the first and discard the second when you save the links for later parsing.
Use an HTTP client such as Typhoeus with Hydra to simultaneously receive multiple pages and store them in your database using a separate process that parses them and passes the URLs for analysis to the database. This can significantly affect the overall processing time.
Read the website robots.txt file and reduce your requests to avoid beating your server. Nobody likes banded pigs and consumes a significant part of the site’s bandwidth or processor time without permission - this is a good way to be noticed and then banned. At this point, your site will operate with zero bandwidth.