Failed to retrieve html table rows

Question

Failed to retrieve html table rows

enter image description here

I am trying to extract all five rows listed in the table above.

I am using the hpricot Ruby library to retrieve table rows using an xpath expression.

In my example, the xpath expression that I use is / html / body / center / table / tr. Note that I removed the tbody tag from the expression, which is usually the case for successful retrieval.

The strange thing is that I get the first three lines as a result with the absence of the last two lines. I just have no idea what is going on there.

EDIT: Nothing magical about the code, just binding it on request.

require 'open-uri' require 'hpricot' faculty = Hpricot(open("http://www.utm.utoronto.ca/7800.0.html")) (faculty/"/html/body/center/table/tr").each do |text| puts text.to_s end

+4

html ruby xpath web-scraping hpricot

Terry li Nov 20 '11 at 21:11

source share

2 answers

The table/tr path does not exist. This is table/tbody/tr or table//tr . When you use table/tr , you specifically look for <tr> , which is a direct descendant of <table> , but from your image this is not the way markup is structured.

0

d11wtq Nov 20 '11 at 10:10

source share

qerub · Accepted Answer · 2011-11-23T21:18:06+0000

Invalid HTML document. (See http://validator.w3.org/check?uri=http%3A%2F%2Fwww.utm.utoronto.ca%2F7800.0.html .) Hpricot analyzes it differently than your browser - from here different results, but in fact it cannot be blamed. Prior to HTML5, there was no standard method for analyzing invalid HTML documents.

I tried replacing Hpricot with Nokogiri and seemed to give the expected parsing. Code:

 require 'open-uri' require 'nokogiri' faculty = Nokogiri.HTML(open("http://www.utm.utoronto.ca/7800.0.html")) faculty.search("/html/body/center/table/tr").each do |text| puts text end

Maybe you should switch?

Failed to retrieve html table rows

More articles: