Strange Behavior in Packed Ruby Strings

Question

Strange Behavior in Packed Ruby Strings

I am embarrassed by some ruby behavior. Take a look at the following code:

[127].pack("C") == "\x7f" # => true

It makes sense. Now:

 [128].pack("C") # => "\x80" "\x80" # => "\x80" [128].pack("C") == "\x80" # => false

The pack "C" parameter stands for 8-bit unsigned (unsigned char) , which must be fine in order to keep the value 128 . Also, both lines print the same thing, so why are they not equal? Is this related to coding?

I'm on ruby 2.0.0p247.

+6

string ruby encoding

lucas clemente Nov 14 '13 at 12:33

source share

2 answers

In Ruby 1.9, the default encoding of the source file is US-ASCII . Starting with Ruby 2.0, the default encoding has changed to UTF-8 . String literals such as "\x80" are always encoded using the encoding of the source file that contains them.

However, the encoding [128].pack("C") is ASCII-8BIT .

So, [128].pack("C") == "\x80" - false in Ruby 2.0, and true in Ruby 1.9

Putting #coding:some_encoding on the first line of the source file (or right after shebang) can change the default encoding of the source code.

 #coding:ascii puts([128].pack("C") == "\x80")

Print true in Ruby 2.0.

+1

Yu Hao Nov 14 '13 at 12:56

source share

tessi · Accepted Answer · 2013-11-14T12:43:07+0000

This is incorrect because the encodings are different:

 [128].pack("C").encoding #=> #<Encoding:ASCII-8BIT> "\x80".encoding #=> #<Encoding:UTF-8>

(using ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-linux] )

In ruby 2.0, the default encoding for strings is UTF-8, but somehow pack returns an ASCII 8-bit encoded string.

Why is `[127].pack('C') == "\x79"` true then?

However, [127].pack('C') == "\x79" is true , because for code points 0 to 127 ASCII and UTF-8 are not different. This is examined by comparing ruby strings (see the rubinius source code ):

 def ==(other) [...] return false unless @num_bytes == other.bytesize return false unless Encoding.compatible?(self, other) return @data.compare_bytes(other.__data__, @num_bytes, other.bytesize) == 0 end

_{mri c-source is similar, but harder to understand.}

We observe that the comparison checks for compatible encoding. Try the following:

 Encoding.compatible?([127].pack("C"), "\x79") #=> #<Encoding:ASCII-8BIT> Encoding.compatible?([128].pack("C"), "\x80") #=> nil

We see that starting at code point 128, the comparison returns false , even when both strings consist of the same bytes.

Strange Behavior in Packed Ruby Strings

Why is [127].pack('C') == "\x79" true then?

More articles:

Why is `[127].pack('C') == "\x79"` true then?