What encoding should I use to find the code that uses the default encoding?

Question

What encoding should I use to find the code that uses the default encoding?

A common mistake when writing code that reads text from a stream in Java is to forget to specify the encoding. If you do not specify anything, Java will use the default encoding for the platform, which ultimately causes problems ("But it works on my computer!").

To find these problems, I want to use an unusual default encoding, which should interrupt as many I / O operations as possible. The idea is that at least any character outside of ASCII will be garbled.

Most of our documents use UTF-8 encoding. ISO-8859-1 can work because it just saves the input (this is a 1: 1 mapping between bytes and characters). Any umlauts will read two / tree byte sequences. But I wonder if we can do better.

What encoding do you propose to use from the list of supported encodings ?

+4

java character-encoding

Aaron digulla Dec 22 '11 at 9:32

source share

3 answers

By default, UTF-16 encoding has a good chance of "distorting" any document that is not UTF-16.

But I think you are going wrong on this. The best way to detect dodgy code that relies on default encodings is to write some custom rules for something like PMD. Just find code that uses String methods and intruders, I / O classes, etc.

(The problem with the “use strange default encoding” approach is that your testing may not be enough to use the entire violation code, or it may execute the code but not detect the manipulation.)

+2

Stephen c Dec 22 '11 at 9:46

source share

java.nio.charset.Charset has a newDecoder() method that returns a Decoder . Deconder has isAutoDetecting() , isChasetDetected() and detectedCharset() methods that seem useful for your task. Unfortunately, all of these methods are optional.

I think you should take all the available Charsets ( Charset.availableCharsets() ) and check first if they are auto-detectable. Thus, when you receive a new stream, first try using the built-in auto-detection mechanism for those encodings that implement these additional operations.

If none of these decoders can detect the mechanism, you should try to decode the stream (as you explained), trying to apply other encodings. To optimize the process, try sorting the encodings using the following criteria.

National alphabets first. For example, try Cyrillic encodings before it comes to Latin alphabets.

Among the national alphabets, take one that has more characters. For example, Japanese and Chinese will be at the beginning of the line.

The reason for this strategy is that you want to crash as quickly as possible. If your text does not contain Japanese characters, you should check the first character from your stream to understand that it is not Japanese. But if you try to use ASCII encoding to decode French text, you will probably have to read a lot of characters before you see the first è .

+1

Alexr Dec 22 '11 at 9:57

source share

Roger Lindsjö · Accepted Answer · 2011-12-22T09:49:33+0000

I think that any of the 16 or 32-bit UTFs will give you a lot of “null” characters that should break many lines. In addition, using one with a BOM (byte byte marker) should “split” the file.

But I would suggest that there are code analysis tools that can verify the creation of strings, readers and writers without encoding.

Edit: FindBugs seems to be able to do this: Dm: Reliance for default encoding (DM_DEFAULT_ENCODING)

What encoding should I use to find the code that uses the default encoding?

More articles: