Ruby 1.8.7 was not a multibyte character, similar to 1.9+. In general, it treats a string as a sequence of bytes, not characters. If you need to handle these characters better, consider upgrading to 1.9+.
James Gray has published a series of articles about working with multibyte characters in Ruby 1.8. I highly recommend taking the time to read them. This is a tricky question, so you'll want to read the entire series that he wrote a couple of times.
In addition, the $KCODE
flag is required to support 1.8 encoding:
$KCODE = "U"
so you need to add this code in 1.8.
Here are some sample code:
#encoding: UTF-8 require 'rubygems' require 'iconv' chars = "éáéíóúÀÉÍÓÚ" puts Iconv.iconv("ASCII//translit", "utf-8", chars) puts chars.split('') puts chars.split('').join
Using ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-darwin10.7.0] and running it in IRB, I get:
1.8.7 :001 > #encoding: UTF-8 1.8.7 :002 > 1.8.7 :003 > require 'iconv' true 1.8.7 :004 > 1.8.7 :005 > chars = "\303\251\303\241\303\251\303\255\303\263\303\272\303\200\303\211\303\215\303\223\303\232" "\303\251\303\241\303\251\303\255\303\263\303\272\303\200\303\211\303\215\303\223\303\232" 1.8.7 :006 > 1.8.7 :007 > puts Iconv.iconv("ASCII//translit", "utf-8", chars) 'e'a'e'i'o'u`A'E'I'O'U nil 1.8.7 :008 > 1.8.7 :009 > puts chars.split('') ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? nil 1.8.7 :010 > puts chars.split('').join éáéíóúÀÉÍÓÚ
On line 9 of the output, I told Ruby to split the line into its concept of characters, which in 1.8.7 was bytes. The resulting '?' means that he did not know what to do with the exit. On line 10, I told her to split, which led to an array of bytes, which join
then reassembled into a regular string, allowing multi-byte characters to be translated normally.
Running the same code using Ruby 1.9.2 shows better and more expected and desirable behavior:
1.9.2p290 :001 > #encoding: UTF-8 1.9.2p290 :002 > 1.9.2p290 :003 > require 'iconv' true 1.9.2p290 :004 > 1.9.2p290 :005 > chars = "éáéíóúÀÉÍÓÚ" "éáéíóúÀÉÍÓÚ" 1.9.2p290 :006 > 1.9.2p290 :007 > puts Iconv.iconv("ASCII//translit", "utf-8", chars) 'e'a'e'i'o'u`A'E'I'O'U nil 1.9.2p290 :008 > 1.9.2p290 :009 > puts chars.split('') é á é í ó ú À É Í Ó Ú nil 1.9.2p290 :010 > puts chars.split('').join éáéíóúÀÉÍÓÚ
Ruby supported multibyte characters using split('')
.
Note that in both cases Iconv.iconv
did the right thing, it created characters that were visually similar to the input characters. While the leading apostrophe looks out of place, it was there, as a reminder, was originally accented by the character.
For more information, see the links on the right for related questions, or try this SO search for [ruby] [iconv]