Avoiding UTF-8 specification rollover when reading files

I am using a data feed that recently added the Unicode specification header (U + FEFF) and my rake task is now confused.

I can skip the first 3 bytes with file.gets[3..-1] , but is there a more elegant way to read files in Ruby that can handle this correctly, regardless of whether the specification is present?

+29
ruby file unicode byte-order-mark
Feb 12 '09 at 20:59
source share
3 answers

With ruby ​​1.9.2 you can use r:bom|utf-8 mode

 text_without_bom = nil #define the variable outside the block to keep the data File.open('file.txt', "r:bom|utf-8"){|file| text_without_bom = file.read } 

or

 text_without_bom = File.read('file.txt', encoding: 'bom|utf-8') 

or

 text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8') 

It does not matter if the specification is available in the file or not.




You can also use the encoding option with other commands:

 text_without_bom = File.readlines(@filename, "r:utf-8") 

(You get an array with all the rows).

Or with CSV:

 require 'csv' CSV.open(@filename, 'r:bom|utf-8'){|csv| csv.each{ |row| p row } } 
+47
Oct. 15 '11 at 20:48
source share

I would not blindly miss the first three bytes; What if the manufacturer stops adding the specification again? You should study the first few bytes, and if they are 0xEF 0xBB 0xBF, ignore them. That the form of the BOM symbol (U + FEFF) takes in UTF-8; I prefer to deal with it before trying to decode the stream, because the specification processing is so incompatible with one language / tool / structure until the next.

In fact, this is how you should deal with the specification. If the file was filed as UTF-16, before decoding, you need to check the first two bytes so that you know whether to consider it as big-endian or little-endian. Of course, the UTF-8 specification has nothing to do with the byte order, it is just there to let you know that the encoding is UTF-8, if you did not already know it.

+10
Feb 13 '09 at 15:04
source share

I will not "trust" some file that will be encoded as UTF-8, when the specification 0xEF 0xBB 0xBF is present, you can fail. Usually when you find the UTF-8 specification, it really should be a UTF-8 encoded file, of course. But, if, for example, someone just added the UTF-8 specification to an ISO file, you would not be able to encode such a file so badly if it has bytes that exceed 0x0F. You can trust the file if you only have bytes up to 0x0F inside, because in this case it is an ASCII file compatible with UTF-8, and at the same time it is a valid UTF-8 file.

If the file contains not only bytes <= 0x0F (after the specification), to make sure that it is correctly encoded in UTF-8 encoding, you will need to check the valid sequences and - even if all sequences are valid - check also if each code from the sequence uses the shortest possible sequence and also checks to see if there is a code point that matches a high or low surrogate. Also check to see if the maximum bytes of the sequence are at most 4 and the highest code is 0x10FFFF. The highest encoding level also limits the high byte payload bits to no higher than 0x4 and the first byte payload to no higher than 0xF. If all of the above checks pass successfully, your UTF-8 specification is telling the truth.

0
Jun 03 '13 at 15:05
source share



All Articles