Default character encoding for Java console output

How does Java determine the encoding used for System.out ?

Given the following class:

 import java.io.File; import java.io.PrintWriter; public class Foo { public static void main(String[] args) throws Exception { String s = "xxäñxx"; System.out.println(s); PrintWriter out = new PrintWriter(new File("test.txt"), "UTF-8"); out.println(s); out.close(); } } 

It is saved as UTF-8 and compiled with javac -encoding UTF-8 Foo.java on a Windows system.

Then on the git - bash console (using UTF-8 encoding) I:

 $ java Foo xxõ±xx $ java -Dfile.encoding=UTF-8 Foo xx├ñ├▒xx $ cat test.txt xxäñxx $ java Foo | cat xxäñxx $ java -Dfile.encoding=UTF-8 Foo | cat xxäñxx 

What's going on here?

Obviously, java checks if it is connected to the terminal and in this case changes its encoding. Is there a way to get Java to simply display plain UTF-8?


I also tried with the cmd console. The STDOUT redirection does not seem to make any difference. Without the file.encoding parameter, it issues an ansi encoding with a parameter that outputs utf8 encoding.

+2
source share
1 answer

I assume that your console is still running under cmd.exe. I doubt your console really expects UTF-8 - I expect it to be truly DOS OEM encoding (e.g. 850 or 437. )

Java will encode bytes using the default encoding set during JVM initialization.

Play on my pc:

 java Foo 

Java is encoded as windows-1252; the console is decoded as an IBM850. Result: Mojibake

 java -Dfile.encoding=UTF-8 Foo 

Java is encoded as UTF-8; the console is decoded as an IBM850. Result: Mojibake

 cat test.txt 

cat decodes the file as UTF-8; cat is encoded as IBM850; the console is decoded as an IBM850.

 java Foo | cat 

Java is encoded as windows-1252; cat is decoded as windows-1252; cat is encoded as IBM850; console decodes as IBM850

 java -Dfile.encoding=UTF-8 Foo | cat 

Java is encoded as UTF-8; cat decodes as UTF-8; cat is encoded as IBM850; console decodes as IBM850

This cat implementation should use a heuristic to determine if the character data is UTF-8 or not, then transcode the data from UTF-8 or ANSI (e.g. windows-1252) to console encoding (e.g. IBM850.)

This can be confirmed using the following commands:

 $ java HexDump utf8.txt 78 78 c3 a4 c3 b1 78 78 $ cat utf8.txt xxäñxx $ java HexDump ansi.txt 78 78 e4 f1 78 78 $ cat ansi.txt xxäñxx 

The cat command can perform this determination because e4 f1 not a valid UTF-8 sequence.

You can fix Java output:

HexDump is a trivial Java application:

 import java.io.*; class HexDump { public static void main(String[] args) throws IOException { try (InputStream in = new FileInputStream(args[0])) { int r; while((r = in.read()) != -1) { System.out.format("%02x ", 0xFF & r); } System.out.println(); } } } 
+7
source

Source: https://habr.com/ru/post/1246682/


All Articles