How can I get a mailing address from HTML using Nokogiri?

Question

How can I get a mailing address from HTML using Nokogiri?

How can I get a mailing address from HTML using Nokogiri? I think in regular terms, but I don't know if this is the best solution.

Code example

<html> <title>Example</title> <body> This is an example text. <a href="mailto: example@example.com ">Mail to me</a> </body> </html>

My question is: if there is a method in nokogiri to get the mailing address, if it is not between some tags.

thanks

+4

ruby nokogiri

jgiunta Feb 29 '12 at 1:12

source share

3 answers

Try to get the whole html page and use regular expressions.

0

freeze Feb 29 '12 at 16:59

source share

I will provide this by saying that I know nothing about Nokogiri. But I just went to their site and looked at the documentation, and it looks pretty cool.

If you add the email_field class (or something else that you want to call) to your email, you can modify your sample code to do what you are looking for.

 require 'nokogiri' require 'open-uri' # Get a Nokogiri::HTML:Document for the page we're interested in... doc = Nokogiri::HTML(open('http://www.yoursite.com/your_page.html')) # Do funky things with it using Nokogiri::XML::Node methods... #### # Search for nodes by css doc.css('.email_field').each do |email| # assuming you have than one, do something with all your email fields here end

If I were you, I would just look at their documentation and experiment with some of their examples.

Here is the website: http://nokogiri.org/

0

Phillipkregg Feb 29 '12 at 20:11

source share

matt · Accepted Answer · 2012-02-29T21:21:41+0000

You can extract email addresses using xpath.

The //a selector will select any a tags on the page, and you can specify the href attribute using the @ syntax, so //a/@href will give you the href all a tags on the page.

If there are several possible a tags on the page with different types of URLs (for example, http:// urls), you can use the xpath functions to further narrow the selected nodes. Selector

 //a[starts-with(@href, \"mailto:\")]/@href

will provide you with href nodes of all a tags that have an href attribute that starts with "mailto:".

Combining all this and adding a little extra code to cut out "mailto:" from the beginning of the attribute value:

 require 'nokogiri' selector = "//a[starts-with(@href, \"mailto:\")]/@href" doc = Nokogiri::HTML.parse File.read 'my_file.html' nodes = doc.xpath selector addresses = nodes.collect {|n| n.value[7..-1]} puts addresses

With a test file that looks like this:

 <html> <title>Example</title> <body> This is an example text. <a href="mailto: example@example.com ">Mail to me</a> <a href="http://example.com">A Web link</a> <a>An empty anchor.</a> </body> </html>

this code displays the desired example@example.com . addresses is an array of all email addresses in the mailto links in the document.

How can I get a mailing address from HTML using Nokogiri?

More articles: