Total Weight< 1 g

Nokogiri: Parsing an irregular "<"

I am trying to use nokogiri to analyze the next segment

<tr>
 <th>Total Weight</th>
 <td>< 1 g</td>
 <td style="text-align: right">0 %</td>

</tr>             
<tr><td class="skinny_black_bar" colspan="3"></td></tr>

However, I think that the "<" sign "<1 g" causes Nokogiri problems. Does anyone know any workarounds? Is there a way I can escape the "<" sign? Or maybe there is a function that I can call to just get a simple html segment?

+3
source share
2 answers

"less than" (<) HTML, , HTML , . HTML , .

, , , , HTML. HTML XML:

f = File.open("table.html")
doc = Nokogiri::HTML(f)

, < 1 g. , TD:

doc.xpath('(//td)[1]/text()').to_s
=> "\n "

doc.xpath('(//td)[2]/text()').to_s
=> "0 %"

, . Nokogiri:

doc.errors
=> [#<Nokogiri::XML::SyntaxError: htmlParseStartTag: invalid element name>]
doc.errors[0].line
=> 3

Yup, 3 .

, , Nokogiri HTML, . . TagSoup , <, &lt; :

% java -jar tagsoup-1.1.3.jar foo.html | xmllint --format -
src: foo.html
<?xml version="1.0" standalone="yes"?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <body>
    <table>
      <tbody>
        <tr>
          <th colspan="1" rowspan="1">Total Weight</th>
          <td colspan="1" rowspan="1">&lt;1 g</td>
          <td colspan="1" rowspan="1" style="text-align: right">0 %</td>
        </tr>
        <tr>
          <td colspan="3" rowspan="1" class="skinny_black_bar"/>
        </tr>
      </tbody>
    </table>
  </body>
</html>
+2

, reqular :

def fix_irregular_html(html)
  regexp = /<([^<>]*)(<|$)/

  #we need to do this multiple time as regex are overlapping
  while (fixed_html = html.gsub(regexp, "&lt;\\1\\2")) && fixed_html != html
    html = fixed_html
  end

  fixed_html
end

, : https://gist.github.com/796571

,

+4

Source: https://habr.com/ru/post/1755048/