Unicode escape behavior in Java programs

Question

Unicode escape behavior in Java programs

A few days ago I was asked about this program exit:

public static void main(String[] args) { // \u0022 is the Unicode escape for double quote (") System.out.println("a\u0022.length() + \u0022b".length()); }

My first thought was that this program should print a length a\u0022.length() + \u0022b , which is 16 , but it is surprising that it printed 2 . I know that \u0022 is unicode for " , but I thought that this " would be escaped and would represent only one " literal without any particular meaning. And actually Java somehow parsed this line like this:

 System.out.println("a".length() + "b".length());

I can not plunge into this strange behavior. Why don't Unicode screens behave like regular escape sequences?

Update . Apparently, this was one of the brain trainers of Java Puzzlers: Traps, Traps, and Corner Cases, a book written by Joshua Bloch and Neil Gafter. More specifically, the question was related to puzzle 14: Escape Rout.

+5

java

Ali Dehghani Mar 09 '16 at 19:48

source share

3 answers

Before the compiler translates the source code into bytecode, an instruction appears at the lexical translation stage:

 System.out.println("a\u0022.length() + \u0022b".length());

in

 System.out.println("a".length() + "b".length());

Therefore, the result is 2.

Also see this section on lexical translation from the Language Specification:

The raw Unicode character stream is converted to a sequence of tokens using the following three lexical translation steps, which in turn apply:
Translation of Unicode screens (§3.3) in a raw Unicode character stream to the corresponding Unicode character. Unicode selection of the form \ uxxxx, where xxxx is a hexadecimal value, is a UTF-16 code unit whose encoding is xxxx. This translation step allows you to express any program using only ASCII characters.

+7

manouti Mar 09 '16 at 19:51

source share

Just funny that the following works (taken from the link)

 System.out.println("a\".length() + \"b".length());

but the following leads to a compilation error

 System.out.println("a\\\u0022.length() + \\\u0022b".length());

On the second, the compiler should reduce \ and " , combine them like \" , but he tried this and does not compile ( " still closes the line).

0

Sci prog Mar 13 '16 at 21:49

source share

Jon skeet · Accepted Answer · 2016-03-09T19:52:15+0000

Why don't Unicode escape sequences behave like regular escape sequences?

Basically, they are processed elsewhere when reading input - in lexing, and not in parsing, if I have the correct terminology. They are not escape sequences in character literals or string literals; they are escape sequences for the entire source file. Any character that is not part of the Unicode escape sequence can be replaced with a Unicode escape sequence. That way, you can completely write programs in ASCII that actually have names of variables, methods, and classes that are not ASCII ...

Basically, I believe that it was a design error in Java, because it can cause some very strange effects (for example, if you have an escape sequence to break a line in a comment // ...), but this is .. .

This is described in detail in the section of the JLS section 3.3 :

The compiler for the Java programming language (“Java compiler”) first recognizes Unicode escape variants in its input by translating the ASCII \ u characters, and then four hexadecimal digits into the UTF-16 code block (§ 3.1) for the specified hexadecimal value and passing all other characters unchanged. Representing extra characters requires two consecutive Unicode screens. This translation step results in a sequence of input Unicode characters.
...
The Java programming language defines a standard way to convert a program written in Unicode to ASCII, which changes the program into a form that can be processed with ASCII-based tools. Conversion involves converting all Unicode screens to the program source code in ASCII by adding an extra u - for example, \ uxxxx becomes \ uuxxxx - while converting non-ASCII characters in the source text to Unicode escape sequences containing one u each.
This converted version is equally acceptable for the Java compiler and is the same program. The exact Unicode source can be later restored from this ASCII form by converting each escape sequence where several u are present in the Unicode character sequence with one less than u, while simultaneously converting each escape sequence with one u to the corresponding one Unicode character.

Unicode escape behavior in Java programs

More articles: