Parse html table with Nokogiri and Mechanize

Question

Parse html table with Nokogiri and Mechanize

Using the following code, I am trying to clear the call log from our phone provider web application to enter information into my Ruby on Rails application.

desc "Import incoming calls" task :fetch_incomingcalls => :environment do # Logs into manage.phoneprovider.co.uk and retrieved list of incoming calls. require 'rubygems' require 'mechanize' require 'logger' # Create a new mechanize object agent = Mechanize.new { |a| a.log = Logger.new(STDERR) } # Load the Phone Provider website page = agent.get("https://manage.phoneprovider.co.uk/login") # Select the first form form = agent.page.forms.first form.username = 'username form.password = 'password # Submit the form page = form.submit form.buttons.first # Click on link called Call Logs page = agent.page.link_with(:text => "Call Logs").click # Click on link called Incoming Calls page = agent.page.link_with(:text => "Incoming Calls").click # Prints out table rows # puts doc.css('table > tr') # Print out the body as a test # puts page.body end

As you can see from the last five lines, I checked that "puts page.body" works successfully, and the code above works. He successfully logs in, and then moves on to call logs, followed by incoming calls. The incoming call table is as follows:

 | Timestamp | Source | Destination | Duration | | 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 | | 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 | | 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 | | 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |

What is generated from the following code:

 <thead> <tr> <td>Timestamp</td> <td>Source</td> <td>Destination</td> <td>Duration</td> <td>Cost</td> <td class='centre'>Recording</td> </tr> </thead> <tbody> <tr class='o'> <tr> <td>03 Jan 13:40</td> <td>12345678</td> <td>12345679</td> <td>00:01:14</td> <td></td> <td class='opt recording'> </td> </tr> </tr> <tr class='e'> <tr> <td>30 Dec 20:31</td> <td>12345678</td> <td>12345679</td> <td>00:02:52</td> <td></td> <td class='opt recording'> </td> </tr> </tr> <tr class='o'> <tr> <td>24 Dec 00:03</td> <td>12345678</td> <td>12345679</td> <td>00:00:09</td> <td></td> <td class='opt recording'> </td> </tr> </tr> <tr class='e'> <tr> <td>23 Dec 14:56</td> <td>12345678</td> <td>12345679</td> <td>00:00:07</td> <td></td> <td class='opt recording'> </td> </tr> </tr> <tr class='o'> <tr> <td>21 Dec 13:26</td> <td>07793770851</td> <td>12345679</td> <td>00:00:26</td> <td></td> <td class='opt recording'> </td> </tr> </tr>

I am trying to decide how to select only those cells that I want (Timestamp, Source, Destination and Duration) and display them. Then I can worry about outputting them to the database, and not to Terminal.

I tried using a selector gadget, but it just shows "td" or "tr: nth-child (6) td, tr: nth-child (2) td" if I select several.

Any help or pointers would be appreciated!

+4

html html-table ruby-on-rails nokogiri mechanize

dannymcc Jan 05 '12 at 20:12

source share

3 answers

Your answer lies with these railscasts.

http://railscasts.com/episodes/190-screen-scraping-with-nokogiri

It may also help.

How to parse an HTML table using Nokogiri?

+2

dbKooper Jan 9 '12 at 13:06

source share

You can get the exact node that is required from the root (worst case) using XPath selectors. Using XPath with Nokogiri is here .

For more information on how to use all of your elements with XPath, see here .

-1

jake Jan 6 '12 at 6:50

source share

Ezekiel templin · Accepted Answer · 2012-01-06T20:03:52+0000

There is a template in the table that is easy to use with XPath. The <tr> lines with the required information does not have a class attribute. Fortunately, XPath provides some simple logical operations, including not() . This provides only the functionality we need.

Once we have reduced the number of rows we are dealing with, we can iterate over the rows and extract the text of the desired columns using the XPath element[n] selector. It is important to note that XPath counts items starting at 1, so the first column of the table row will be td[1] .

Sample code using Nokogiri (and specifications):

 require "rspec" require "nokogiri" HTML = <<HTML <table> <thead> <tr> <td> Timestamp </td> <td> Source </td> <td> Destination </td> <td> Duration </td> <td> Cost </td> <td class='centre'> Recording </td> </tr> </thead> <tbody> <tr class='o'> <td></td> </tr> <tr> <td> 03 Jan 13:40 </td> <td> 12345678 </td> <td> 12345679 </td> <td> 00:01:14 </td> <td></td> <td class='opt recording'></td> </tr> <tr class='e'> <td></td> </tr> <tr> <td> 30 Dec 20:31 </td> <td> 12345678 </td> <td> 12345679 </td> <td> 00:02:52 </td> <td></td> <td class='opt recording'></td> </tr> <tr class='o'> <td></td> </tr> <tr> <td> 24 Dec 00:03 </td> <td> 12345678 </td> <td> 12345679 </td> <td> 00:00:09 </td> <td></td> <td class='opt recording'></td> </tr> <tr class='e'> <td></td> </tr> <tr> <td> 23 Dec 14:56 </td> <td> 12345678 </td> <td> 12345679 </td> <td> 00:00:07 </td> <td></td> <td class='opt recording'></td> </tr> <tr class='o'> <td></td> </tr> <tr> <td> 21 Dec 13:26 </td> <td> 07793770851 </td> <td> 12345679 </td> <td> 00:00:26 </td> <td></td> <td class='opt recording'></td> </tr> </tbody> </table> HTML class TableExtractor def extract_data html Nokogiri::HTML(html).xpath("//table/tbody/tr[not(@class)]").collect do |row| timestamp = row.at("td[1]").text.strip source = row.at("td[2]").text.strip destination = row.at("td[3]").text.strip duration = row.at("td[4]").text.strip {:timestamp => timestamp, :source => source, :destination => destination, :duration => duration} end end end describe TableExtractor do before(:all) do @html = HTML end it "should extract the timestamp properly" do subject.extract_data(@html)[0][:timestamp].should eq "03 Jan 13:40" end it "should extract the source properly" do subject.extract_data(@html)[0][:source].should eq "12345678" end it "should extract the destination properly" do subject.extract_data(@html)[0][:destination].should eq "12345679" end it "should extract the duration properly" do subject.extract_data(@html)[0][:duration].should eq "00:01:14" end it "should extract all informational rows" do subject.extract_data(@html).count.should eq 5 end end

Parse html table with Nokogiri and Mechanize

More articles: