Why does Iconv work in irb and Ruby interpreter?

Question

Why does Iconv work in irb and Ruby interpreter?

I need to convert Latin characters, such as éáéíóúÀÉÍÓÚ , etc., to a string with similar characters without special accents or wired characters:

 é -> e è -> e Ä -> A

I have a file called "test.rb":

 require 'iconv' puts Iconv.iconv("ASCII//translit", "utf-8", 'è').join

When I insert these lines in irb, it works by returning "e" as expected.

Duration:

 $ ruby test.rb

I get a " ? " As an output.

I am using irb 0.9.5 (05/04/13) and Ruby 1.8.7 (2011-06-30 patchlevel 352) [i386-linux].

+4

ruby iconv irb

zambotn Dec 9 '11 at 12:55

source share

1 answer

the tin man · Accepted Answer · 2011-12-09T15:47:42+0000

Ruby 1.8.7 was not a multibyte character, similar to 1.9+. In general, it treats a string as a sequence of bytes, not characters. If you need to handle these characters better, consider upgrading to 1.9+.

James Gray has published a series of articles about working with multibyte characters in Ruby 1.8. I highly recommend taking the time to read them. This is a tricky question, so you'll want to read the entire series that he wrote a couple of times.

In addition, the $KCODE flag is required to support 1.8 encoding:

 $KCODE = "U"

so you need to add this code in 1.8.

Here are some sample code:

 #encoding: UTF-8 require 'rubygems' require 'iconv' chars = "éáéíóúÀÉÍÓÚ" puts Iconv.iconv("ASCII//translit", "utf-8", chars) puts chars.split('') puts chars.split('').join

Using ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-darwin10.7.0] and running it in IRB, I get:

 1.8.7 :001 > #encoding: UTF-8 1.8.7 :002 > 1.8.7 :003 > require 'iconv' true 1.8.7 :004 > 1.8.7 :005 > chars = "\303\251\303\241\303\251\303\255\303\263\303\272\303\200\303\211\303\215\303\223\303\232" "\303\251\303\241\303\251\303\255\303\263\303\272\303\200\303\211\303\215\303\223\303\232" 1.8.7 :006 > 1.8.7 :007 > puts Iconv.iconv("ASCII//translit", "utf-8", chars) 'e'a'e'i'o'u`A'E'I'O'U nil 1.8.7 :008 > 1.8.7 :009 > puts chars.split('') ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? nil 1.8.7 :010 > puts chars.split('').join éáéíóúÀÉÍÓÚ

On line 9 of the output, I told Ruby to split the line into its concept of characters, which in 1.8.7 was bytes. The resulting '?' means that he did not know what to do with the exit. On line 10, I told her to split, which led to an array of bytes, which join then reassembled into a regular string, allowing multi-byte characters to be translated normally.

Running the same code using Ruby 1.9.2 shows better and more expected and desirable behavior:

 1.9.2p290 :001 > #encoding: UTF-8 1.9.2p290 :002 > 1.9.2p290 :003 > require 'iconv' true 1.9.2p290 :004 > 1.9.2p290 :005 > chars = "éáéíóúÀÉÍÓÚ" "éáéíóúÀÉÍÓÚ" 1.9.2p290 :006 > 1.9.2p290 :007 > puts Iconv.iconv("ASCII//translit", "utf-8", chars) 'e'a'e'i'o'u`A'E'I'O'U nil 1.9.2p290 :008 > 1.9.2p290 :009 > puts chars.split('') é á é í ó ú À É Í Ó Ú nil 1.9.2p290 :010 > puts chars.split('').join éáéíóúÀÉÍÓÚ

Ruby supported multibyte characters using split('') .

Note that in both cases Iconv.iconv did the right thing, it created characters that were visually similar to the input characters. While the leading apostrophe looks out of place, it was there, as a reminder, was originally accented by the character.

For more information, see the links on the right for related questions, or try this SO search for [ruby] [iconv]

Why does Iconv work in irb and Ruby interpreter?

More articles: