I am doing some work on a common workaround (a big crawl on the Internet), and I continue to see a strange coding scheme that I just canβt figure out how to deal with.
The pattern that I see over and over looks like the sequence of bytes 50 6f 6b e9 6d 6f 6e
that I assume is intended to represent PokΓ©mon
.
Now coding schemes are not my strongest point, but I do not know of any coding where it is valid for representing Γ©
as soon as e9
.
This is a bit like [UTF-16] [1], which will be fe ff 00 50 00 6f 00 6b 00 e9 00 6d 00 6f 00 6e
And it is definitely not UTF-8, which would be 50 6f 6b c3 a9 6d 6f 6e
So, I just after decrypting these bytes into a string in Java, the library would be ideal.
new String(bytes)
justifiably does not work and fairly converts the e9
character to the replacement character ef bf bd
(just like the scary one)
Any ideas on how to deal with them?
Update
In the end, I used the character set encoding detector that was introduced in Apache Tika [2]. It works well.
[1] http://www.fileformat.info/info/unicode/char/e9/index.htm
[2] http://tika.apache.org/0.8/api/org/apache/tika/parser/txt/CharsetDetector.html
source share