Change character encoding

I'm having problems changing the encoding in a text file in Ruby 1.9.2p290. I am getting an invalid byte sequence error in UTF-8 (ArgumentError) . The problem (I think) is that the encoding seems unknown.

From the command line if I do the following:

$ file test.txt 

I get:

 Non-ISO extended-ASCII English text, with CRLF line terminators 

Or, conversely, if I do this:

 $ file -i test.txt 

I get:

 test.txt: text/plain; charset=unknown 

However, in Ruby, if I do this:

 data = File.open("test.txt").read puts data.encoding.name puts data.valid_encoding? 

I get:

 UTF-8 false 

Here's a simplified snippet of code:

 data = File.open("test.txt").read data.encode!("UTF-8") data.each_line do |line| newfile_data << line end 
+4
source share
2 answers

In ruby ​​1.9, each stream has 2 encodings associated with it - external and internal encoding. External encoding is the encoding of the text that you are reading from the stream (in your case, it is the encoding of the file). Internal encoding is the required encoding for text that is read from a file.

If you do not set the external / internal encoding for the stream, then the default external / internal encoding of the process will be used. If no internal encoding is specified, the line read from the stream is marked (not converted) with external encoding (the same as String.force_encoding .

Most likely you have

 Encoding::default_external # => Encoding:UTF-8 Encoding::default_internal # => nil 

And your file is encoded in standard encoded characters in ASCII, and not in UTF-8. Your Ruby code reads a sequence of bytes from an external source into a UTF-8 string. And since your line contains Non-ISO extended-ASCII English text , do you get data.valid_encoding? # => false data.valid_encoding? # => false .

You need to set the external encoding of your stream to the file encoding. For example, if you have a cp 1251 encoded file with the text , you need to read it with the following code:

 data = File.open("test.txt", 'r:windows-1251').read puts data.encoding.name # => windows-1251 puts data.valid_encoding? # => true 

or even specify both internal and external encoding:

 data = File.open("test.txt", 'r:windows-1251:utf-8').read puts data.encoding.name # => utf-8 puts data.valid_encoding? # => true 
+8
source
 data = IO.read("test.txt", :encoding => 'windows-1252') data = data.encode("UTF-8").gsub("\r\n", "\n") 
+2
source

Source: https://habr.com/ru/post/1387645/


All Articles