Analysis Table Using Nokogiri

I would like to analyze a table using Nokogiri. I do it like this.

def parse_table_nokogiri(html)

    doc = Nokogiri::HTML(html)

    doc.search('table > tr').each do |row|
        row.search('td/font/text()').each do |col|
            p col.to_s
        end
    end

end

The part of the table in which I have these rows:

<tr>
  <td>
     Some text
  </td>
</tr>

... and some have it.

<tr>
  <td>
     <font> Some text </font>
  </td>
</tr>

My XPath expression works for the second scenario, but not the first. Is there an XPath expression that I could use that would give me text from the innermost node of the cell so that I can handle both scripts?


I included the changes in my fragment

def parse_table_nokogiri(html)

    doc = Nokogiri::HTML(html)
    table = doc.xpath('//table').max_by {|table| table.xpath('.//tr').length}

    rows = table.search('tr')[1..-1]
    rows.each do |row|

        cells = row.search('td//text()').collect {|text| CGI.unescapeHTML(text.to_s.strip)}
        cells.each do |col|

            puts col
            puts "_____________"

        end

    end

end
+3
source share
3 answers

Using

td//text()[normalize-space()]

This selects all the child tags of the node of any child of the tdcurrent node ( tralready selected in your code).

Or, if you want to select all descendants of a text node, consider whether they are only spaces or not:

td//text()

UPDATE

, td a '&#160;' ( ).

td, ( ) nbsp , :

td//text()[translate(normalize-space(), '&#160;', '')]
+6

:

doc.search('//td').each do |cell|
  puts cell.content
end
+2

( DRY) :

require 'nokogiri'

doc = Nokogiri::HTML <<ENDHTML
<body><table><thead><tr><td>NOT THIS</td></tr></thead><tr>
  <td>foo</td>
  <td><font>bar</font></td>
</tr></table></body>
ENDHTML

p doc.xpath( '//table/tr/td/text()|//table/tr/td/font/text()' )
#=> [#<Nokogiri::XML::Text:0x80428814 "foo">,
#=>  #<Nokogiri::XML::Text:0x804286fc "bar">]

. XPath .

, , :

p doc.xpath( '//table/tr/td//text()' )
#=> [#<Nokogiri::XML::Text:0x80428814 "foo">,
#=>  #<Nokogiri::XML::Text:0x804286fc "bar">]

, ( ), tbody, XHTML. , table > tr, , .

+1

Source: https://habr.com/ru/post/1788978/


All Articles