Working with incorrectly encoded UTF-16 (?) In Java

I am doing some work on a common workaround (a big crawl on the Internet), and I continue to see a strange coding scheme that I just can’t figure out how to deal with.

The pattern that I see over and over looks like the sequence of bytes 50 6f 6b e9 6d 6f 6e that I assume is intended to represent PokΓ©mon .

Now coding schemes are not my strongest point, but I do not know of any coding where it is valid for representing Γ© as soon as e9 .

This is a bit like [UTF-16] [1], which will be fe ff 00 50 00 6f 00 6b 00 e9 00 6d 00 6f 00 6e

And it is definitely not UTF-8, which would be 50 6f 6b c3 a9 6d 6f 6e

So, I just after decrypting these bytes into a string in Java, the library would be ideal.

new String(bytes) justifiably does not work and fairly converts the e9 character to the replacement character ef bf bd (just like the scary one)

Any ideas on how to deal with them?

Update

In the end, I used the character set encoding detector that was introduced in Apache Tika [2]. It works well.

[1] http://www.fileformat.info/info/unicode/char/e9/index.htm

[2] http://tika.apache.org/0.8/api/org/apache/tika/parser/txt/CharsetDetector.html

+4
source share
1 answer

This is either ISO-8859-1 or Windows-1252 , the latter being essentially a superset of the former. Use new String(bytes, "ISO-8859-1") or new String(bytes, "Windows-1252") .

+7
source

Source: https://habr.com/ru/post/1383253/


All Articles