Guessing the string encodings for the byte stream in the log file

Question

Guessing the string encodings for the byte stream in the log file

tl; dr summary . Given a stream of bytes representing a string in an unknown encoding, what encodings and in which order should I try to interpret the bytes in order to get the best chance of finding the 'correct' encoding?

Problem example

I have an arrows.txt file which, as arrows.txt as I know, was saved using UTF-8 with single character content ⇈ . If I pretend I don't know what the encoding of this file is, the following Ruby for Windows code does not work:

 s = IO.read('foo.txt') p s.encoding, #=> #<Encoding:IBM437> s.valid_encoding?, #=> true s.chars.to_a #=> ["\xE2", "\x87", "\x88"]

It doesn’t work, because it tells me that the file actually had the contents of Γçê , and everything is fine (the encoding is valid).

Real world scenario

I have Nginx log files and Akamai log files that do not have any specific encoding for the requests they write, that I need to process and store the data in the database as UTF-8. In most cases, interpreting each line as UTF-8 creates a line with valid encoding, but sometimes it is not.

I want to ask Ruby to try different encodings for each line to find the correct and probable (but, of course, not guaranteed) correctness.

Unsuccessful attempt

I originally wrote the following code:

 def guess_encoding( str, result='utf-8', *encodings ) # Try every encoding if none were passed in encodings = Encoding.list if encodings.empty? # Keep forcing a new encoding until we find one that is valid unless encodings.find{ |e| str.force_encoding(e) && str.valid_encoding? } raise "None of the supplied encodings was valid" end # Convert from the valid encoding to the desired, replacing 'bad' characters str.encode(result, invalid: :replace, undef: :replace) end

The problem is that the very first encoding in Encoding.list is ASCII-8BIT , which is valid for all byte streams. Thus, if I use my code above and call s2 = guess_encoding(s) , the result is the string for my three-byte double-arrow character above.

Finally, question (s)

What order should the encodings be checked to ensure the maximum probability that the first valid_encoding? will be right? Which generic encodings are the most fun with respect to the bytes used, so I have to try them first, and which generic encodings are completely permissive, so should I try them last?

Are there any other heuristics that I should use in assuming correctness? (Most likely it will be correct if a particular encoding results in fewer characters than the other?)

+4

ruby encoding

Phrogz Feb 13 '12 at 18:09

source share

3 answers

If your code can run on a unix / linux machine, filemagic may work well for you.

 gem install ruby-filemagic

This is most useful as a tool to determine the encoding of the entire file, which can then be used for all lines in the file. The following should help you get started with it:

 $ irb irb(main):001:0> require 'filemagic' => true irb(main):002:0> fm = FileMagic.new => #<FileMagic:0x7fd4afb0> irb(main):003:0> fm.file('afile.zip') => "Zip archive data, at least v2.0 to extract" irb(main):004:0>

+1

James Feb 13 '12 at 18:53

source share

When I was writing spiders, I always started with ISO-8859-1 and then with Win-1252. The difference between them is small, so either should be suitable in most cases. My reason for these first two, I think you are most likely to come across them.

If something doesn’t match these, then I would just use iconv to convert it to UTF-8, or split the diacritical accents so that it is visually similar to what we expected to see, and continue.

There were times when nothing would be a hit; I had a code that retrieved all iconv encodings and then deleted all ASCII values and tried to find the encoding with the most hits for the rest of the characters. XML and HTML were sometimes so distorted that nothing helped, and that was when I returned to focusing.

+1

the tin man Feb 13 '12 at 18:53

source share

user2398029 · Accepted Answer · 2012-02-13T19:11:52+0000

You can try rchardet19 gem. He "takes a sequence of bytes in the encoding of an unknown character and tries to determine the encoding." It also gives you a confidence score for the returned encoding. He has worked for me several times in the past and looks like he is doing what you are trying to accomplish.

Usage example:

 require 'rchardet19' cd = CharDet.detect("some data") # => #<struct #<Class:0x102216198> encoding="ascii", confidence=1.0>

Guessing the string encodings for the byte stream in the log file

Problem example

Real world scenario

Unsuccessful attempt

Finally, question (s)

More articles: