In ruby 1.9, each stream has 2 encodings associated with it - external and internal encoding. External encoding is the encoding of the text that you are reading from the stream (in your case, it is the encoding of the file). Internal encoding is the required encoding for text that is read from a file.
If you do not set the external / internal encoding for the stream, then the default external / internal encoding of the process will be used. If no internal encoding is specified, the line read from the stream is marked (not converted) with external encoding (the same as String.force_encoding .
Most likely you have
Encoding::default_external
And your file is encoded in standard encoded characters in ASCII, and not in UTF-8. Your Ruby code reads a sequence of bytes from an external source into a UTF-8 string. And since your line contains Non-ISO extended-ASCII English text , do you get data.valid_encoding? # => false data.valid_encoding? # => false .
You need to set the external encoding of your stream to the file encoding. For example, if you have a cp 1251 encoded file with the text , you need to read it with the following code:
data = File.open("test.txt", 'r:windows-1251').read puts data.encoding.name # => windows-1251 puts data.valid_encoding? # => true
or even specify both internal and external encoding:
data = File.open("test.txt", 'r:windows-1251:utf-8').read puts data.encoding.name # => utf-8 puts data.valid_encoding? # => true
source share