tl; dr summary . Given a stream of bytes representing a string in an unknown encoding, what encodings and in which order should I try to interpret the bytes in order to get the best chance of finding the 'correct' encoding?
Problem example
I have an arrows.txt file which, as arrows.txt as I know, was saved using UTF-8 with single character content β . If I pretend I don't know what the encoding of this file is, the following Ruby for Windows code does not work:
s = IO.read('foo.txt') p s.encoding, #=> #<Encoding:IBM437> s.valid_encoding?, #=> true s.chars.to_a #=> ["\xE2", "\x87", "\x88"]
It doesnβt work, because it tells me that the file actually had the contents of ΞΓ§Γͺ , and everything is fine (the encoding is valid).
Real world scenario
I have Nginx log files and Akamai log files that do not have any specific encoding for the requests they write, that I need to process and store the data in the database as UTF-8. In most cases, interpreting each line as UTF-8 creates a line with valid encoding, but sometimes it is not.
I want to ask Ruby to try different encodings for each line to find the correct and probable (but, of course, not guaranteed) correctness.
Unsuccessful attempt
I originally wrote the following code:
def guess_encoding( str, result='utf-8', *encodings )
The problem is that the very first encoding in Encoding.list is ASCII-8BIT , which is valid for all byte streams. Thus, if I use my code above and call s2 = guess_encoding(s) , the result is the string for my three-byte double-arrow character above.
Finally, question (s)
What order should the encodings be checked to ensure the maximum probability that the first valid_encoding? will be right? Which generic encodings are the most fun with respect to the bytes used, so I have to try them first, and which generic encodings are completely permissive, so should I try them last?
Are there any other heuristics that I should use in assuming correctness? (Most likely it will be correct if a particular encoding results in fewer characters than the other?)