You can extract email addresses using xpath.
The //a selector will select any a tags on the page, and you can specify the href attribute using the @ syntax, so //a/@href will give you the href all a tags on the page.
If there are several possible a tags on the page with different types of URLs (for example, http:// urls), you can use the xpath functions to further narrow the selected nodes. Selector
will provide you with href nodes of all a tags that have an href attribute that starts with "mailto:".
Combining all this and adding a little extra code to cut out "mailto:" from the beginning of the attribute value:
require 'nokogiri' selector = "//a[starts-with(@href, \"mailto:\")]/@href" doc = Nokogiri::HTML.parse File.read 'my_file.html' nodes = doc.xpath selector addresses = nodes.collect {|n| n.value[7..-1]} puts addresses
With a test file that looks like this:
<html> <title>Example</title> <body> This is an example text. <a href="mailto: example@example.com ">Mail to me</a> <a href="http://example.com">A Web link</a> <a>An empty anchor.</a> </body> </html>
this code displays the desired example@example.com . addresses is an array of all email addresses in the mailto links in the document.
source share