") Background: I am using the Ruby Nokogiri gem to parse an XML file. The probl...">

The Ruby Nokogiri SAX parser truncates strings to & gt; (aka ">")

Background: I am using the Ruby Nokogiri gem to parse an XML file. The problem I am facing is that the SAX parser returns an incomplete result when the string contains >, which is the HTML encoding for >. For instance:

<element>PART1PART2</element> #=> returns "PART1PART2"
<element>PART3&gt;PART4</element> #=> returns "PART3"

My parser is as follows:

require 'nokogiri'
class MySample < Nokogiri::XML::SAX::Document
  def characters(string)
    puts string
  end
end
# Create a new parser
parser = Nokogiri::XML::SAX::Parser.new(MySample.new)
# Feed the parser some XML
parser.parse_file(ARGV[0])

: >, , . > XML. XML , , &gt; . , Nokogiri HTML ( &gt; >) .

: Nokogiri HTML &gt; , ?


1- (FWIW)

, , . , , . , , SAX, DOM-.

:

  • Nokogiri v1.6.1. ( ) - v1.6.6, .

  • (. matt ), , (, &gt; , &gt; ..).

  • Ruby Ox , , Nokogiri. , , &gt;. , , >. , Nokogiri ( ).

:

Nokogiri, Ox . , ( , ). , Ox , &gt; / >.

+4
1

, SAX. Nokogiri DOM:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<root>
  <element>PART1PART2</element>
  <element>PART3&gt;PART4</element>
</root>
EOT

puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <root>
# >>   <element>PART1PART2</element>
# >>   <element>PART3&gt;PART4</element>
# >> </root>

, .

0

Source: https://habr.com/ru/post/1535986/


All Articles