Failed to retrieve html table rows

enter image description here

I am trying to extract all five rows listed in the table above.

I am using the hpricot Ruby library to retrieve table rows using an xpath expression.

In my example, the xpath expression that I use is / html / body / center / table / tr. Note that I removed the tbody tag from the expression, which is usually the case for successful retrieval.

The strange thing is that I get the first three lines as a result with the absence of the last two lines. I just have no idea what is going on there.

EDIT: Nothing magical about the code, just binding it on request.

require 'open-uri' require 'hpricot' faculty = Hpricot(open("http://www.utm.utoronto.ca/7800.0.html")) (faculty/"/html/body/center/table/tr").each do |text| puts text.to_s end 
+4
source share
2 answers

Invalid HTML document. (See http://validator.w3.org/check?uri=http%3A%2F%2Fwww.utm.utoronto.ca%2F7800.0.html .) Hpricot analyzes it differently than your browser - from here different results, but in fact it cannot be blamed. Prior to HTML5, there was no standard method for analyzing invalid HTML documents.

I tried replacing Hpricot with Nokogiri and seemed to give the expected parsing. Code:

 require 'open-uri' require 'nokogiri' faculty = Nokogiri.HTML(open("http://www.utm.utoronto.ca/7800.0.html")) faculty.search("/html/body/center/table/tr").each do |text| puts text end 

Maybe you should switch?

+9
source

The table/tr path does not exist. This is table/tbody/tr or table//tr . When you use table/tr , you specifically look for <tr> , which is a direct descendant of <table> , but from your image this is not the way markup is structured.

0
source

Source: https://habr.com/ru/post/1382178/


All Articles