Incompatible ruby and Nokogiri HTML encodings

Question

Incompatible ruby and Nokogiri HTML encodings

I am parsing an external HTML page using Nokogiri. This page is encoded by ISO-8859-1. Some of the data I want to extract contains some data; (dash) html objects:

xml = Nokogiri.HTML(open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1') f = xml.xpath("//div[@style='background-color:#D9DBD9; padding:15px 12px 10px 10px;']//div[@class='tit_inter_cnz']/text()") f[0].text #=> Preview M/E/C/A \u0096 John Digweed

In the last line, the line should be displayed in the browser with a dash . The browser displays it correctly if I specify my page as ISO-8859-1 encoding, however my Sinatra application uses UTF-8. How can I display this text correctly in a browser? Today is displayed as a square with a small amount inside. I tried force_encoding ('ISO-8859-1'), but then I got a CompatibilityError from Sinatra.

Any clues?

[Change] The following are screenshots of the application:

-> Firefox with UTF-8 character encoding Firefox with character encoding UTF-8

-> [Firefox with Western character encoding (ISO-8859-1) Firefox with character encoding Western (ISO-8859-1)

It is worth mentioning that in the ISO-8859-1 mode above the dash is displayed correctly, but there is another incorrect character with it immediately before the dash. Strange: (

+4

ruby encoding nokogiri

Felipe lima Jan 28 '11 at 18:26

source share

3 answers

After parsing the document in Nokogiri, you can tell it to accept another encoding . Try:

 require 'open-uri' require 'nokogiri' doc = Nokogiri::HTML((open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1') doc.encoding = 'UTF-8'

I can not see this page here to confirm this fixes the problem, but it worked for similar problems.

+9

the tin man Jan 28 '11 at 19:26

source share

I work in the publication of scientific manuscripts, and there are many dashes. The dash you are using is not an ASCII type, it is a Unicode dash. Forcing the ISO encoding is likely to change the dash.

http://www.fileformat.info/info/unicode/char/96/index.htm

This site is great for unicode issues.

The reason you get the square is because your browser may not support this. This is probably correct. I would keep the UTF-8 encoding, and if you want to make this dash so that everyone can see it, convert it to an ascii trait.

You can try Iconv to convert characters to ASCII / UTF-8 http://craigjolicoeur.com/blog/ruby-iconv-to-the-rescue

0

Michael papile Jan 28 '11 at 18:49

source share

Phrogz · Accepted Answer · 2011-01-28T20:30:46+0000

Summary Problem symbols are control characters from ISO-8859-1 that are not intended to be displayed.

Details and investigation :
Here is a test showing that you are getting a valid UTF-8 from Nokogiri and Sinatra:

 require 'sinatra' require 'open-uri' get '/' do html = open("http://flybynight.com.br/agenda.php").read p [ html.encoding, html.valid_encoding? ] #=> [#<Encoding:ISO-8859-1>, true] str = html[ /Preview.+?John Digweed/ ] p [ str, str.encoding, str.valid_encoding? ] #=> ["Preview M/E/C/A \x96 John Digweed", #<Encoding:ISO-8859-1>, true] utf8 = str.encode('UTF-8') p [ utf8, utf8.encoding, utf8.valid_encoding? ] #=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true] require 'nokogiri' doc = Nokogiri.HTML(html, nil, 'ISO-8859-1') p doc.encoding #=> "ISO-8859-1" dig = doc.xpath("//div[@class='tit_inter_cnz']")[1] p [ dig.text, dig.text.encoding, dig.text.valid_encoding? ] #=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true] <<-ENDHTML <!DOCTYPE html> <html><head><title>Dig it!</title></head><body> <p>Here it comes...</p> <p>#{dig.text}</p> </body></html> ENDHTML end

This fills the content correctly using Content-Type:text/html;charset=utf-8 on my computer. However, Chrome does not show my this character in the browser.

Parsing this answer, the same pair of Unicode bytes returns to the dash, as shown above: \xC2\x96 . It seems to be this Unicode character that seems like a weird dash.

I could write this down to bad source data and just quit:

 #encoding: UTF-8

at the top of your Ruby source file (s), and then type:

 f = ...text.gsub( "\xC2\x96", "-" ) # Or a better Unicode character

Change If you look at the browser check page for this character , you will see (at least in Chrome and Firefox) that the literal version of UTF-8 is empty, but versions with hexadecimal and decimal escape versions appear. I can’t understand why this is so, but you have it. Browsers simply do not display your character correctly if presented in raw form.

Either make it an HTML object, or another Unicode type. In any case, gsub is called.

Change # 2 . Another odd note: the character in the source encoding has a hexadecimal byte value of 0x96 . As far as I can tell, this is not like the printed character ISO-8859-1 . As shown in the official specification for ISO-8859-1 , this refers to one of two areas without printing. A.

Incompatible ruby ​​and Nokogiri HTML encodings

More articles:

Incompatible ruby and Nokogiri HTML encodings