Guessing the string encodings for the byte stream in the log file

tl; dr summary . Given a stream of bytes representing a string in an unknown encoding, what encodings and in which order should I try to interpret the bytes in order to get the best chance of finding the 'correct' encoding?

Problem example

I have an arrows.txt file which, as arrows.txt as I know, was saved using UTF-8 with single character content β‡ˆ . If I pretend I don't know what the encoding of this file is, the following Ruby for Windows code does not work:

 s = IO.read('foo.txt') p s.encoding, #=> #<Encoding:IBM437> s.valid_encoding?, #=> true s.chars.to_a #=> ["\xE2", "\x87", "\x88"] 

It doesn’t work, because it tells me that the file actually had the contents of ΓçΓͺ , and everything is fine (the encoding is valid).

Real world scenario

I have Nginx log files and Akamai log files that do not have any specific encoding for the requests they write, that I need to process and store the data in the database as UTF-8. In most cases, interpreting each line as UTF-8 creates a line with valid encoding, but sometimes it is not.

I want to ask Ruby to try different encodings for each line to find the correct and probable (but, of course, not guaranteed) correctness.

Unsuccessful attempt

I originally wrote the following code:

 def guess_encoding( str, result='utf-8', *encodings ) # Try every encoding if none were passed in encodings = Encoding.list if encodings.empty? # Keep forcing a new encoding until we find one that is valid unless encodings.find{ |e| str.force_encoding(e) && str.valid_encoding? } raise "None of the supplied encodings was valid" end # Convert from the valid encoding to the desired, replacing 'bad' characters str.encode(result, invalid: :replace, undef: :replace) end 

The problem is that the very first encoding in Encoding.list is ASCII-8BIT , which is valid for all byte streams. Thus, if I use my code above and call s2 = guess_encoding(s) , the result is the string for my three-byte double-arrow character above.

Finally, question (s)

What order should the encodings be checked to ensure the maximum probability that the first valid_encoding? will be right? Which generic encodings are the most fun with respect to the bytes used, so I have to try them first, and which generic encodings are completely permissive, so should I try them last?

Are there any other heuristics that I should use in assuming correctness? (Most likely it will be correct if a particular encoding results in fewer characters than the other?)

+4
source share
3 answers

You can try rchardet19 gem. He "takes a sequence of bytes in the encoding of an unknown character and tries to determine the encoding." It also gives you a confidence score for the returned encoding. He has worked for me several times in the past and looks like he is doing what you are trying to accomplish.

Usage example:

 require 'rchardet19' cd = CharDet.detect("some data") # => #<struct #<Class:0x102216198> encoding="ascii", confidence=1.0> 
+2
source

If your code can run on a unix / linux machine, filemagic may work well for you.

 gem install ruby-filemagic 

This is most useful as a tool to determine the encoding of the entire file, which can then be used for all lines in the file. The following should help you get started with it:

 $ irb irb(main):001:0> require 'filemagic' => true irb(main):002:0> fm = FileMagic.new => #<FileMagic:0x7fd4afb0> irb(main):003:0> fm.file('afile.zip') => "Zip archive data, at least v2.0 to extract" irb(main):004:0> 
+1
source

When I was writing spiders, I always started with ISO-8859-1 and then with Win-1252. The difference between them is small, so either should be suitable in most cases. My reason for these first two, I think you are most likely to come across them.

If something doesn’t match these, then I would just use iconv to convert it to UTF-8, or split the diacritical accents so that it is visually similar to what we expected to see, and continue.

There were times when nothing would be a hit; I had a code that retrieved all iconv encodings and then deleted all ASCII values ​​and tried to find the encoding with the most hits for the rest of the characters. XML and HTML were sometimes so distorted that nothing helped, and that was when I returned to focusing.

+1
source

Source: https://habr.com/ru/post/1396258/


All Articles