Unicode escape syntax in Java

Question

Unicode escape syntax in Java

In Java, I found out that to indicate Unicode characters that are not on the keyboard (for example, non-ASCII characters), you can use the following syntax:

(\u)(u)*(HexDigit)(HexDigit)(HexDigit)(HexDigit)

My question is: What is the purpose of (u) * in the syntax above?

One use case that I realized that represents the yen symbol in Java is:

 char ch = '\u00A5';

+9

java unicode

user3265048 Feb 03 '14 at 8:36

source share

2 answers

This means that you can add as many u as you want - for example, these lines are equivalent:

 char ch = '\u00A5'; char ch = '\uuuuu00A5'; char ch = '\uuuuuuuuuuuuuuuuuu00A5';

(and everything compiles)

-one

assylias Feb 03 '14 at 8:48

source share

Aaron Digulla · Accepted Answer · 2014-02-03 08:54

Interest Ask. Section 3.3 of the JSL says :

 UnicodeEscape: \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit UnicodeMarker: u UnicodeMarker u

which translates to \\u+\p{XDigit}{4}

and

If a suitable \ is followed by u or more than one u, and the last u is not followed by four hexadecimal digits, then a compile-time error occurs.

So, you are right, after a backslash there can be one or more u . The reason is given below:

The Java programming language defines a standard way to convert a program written in Unicode to ASCII, which changes the program into a form that can be processed with ASCII-based tools. Conversion involves converting all Unicode screens to ASCII source code by adding extra u - for example, \ uxxxx becomes \ uuxxxx - while converting non-ASCII characters in the source text to Unicode escape sequences containing one u each.
This converted version is equally acceptable for the Java compiler and is the same program. The exact Unicode source can be later retrieved from this ASCII form by converting each escape sequence where several u are present in a Unicode character sequence with one less than u, while simultaneously converting each escape sequence with one u to the corresponding one Unicode character.

So this input

  \u0020ä

becomes

  \uu0020\u00e4

The first uu means here: "it was a Unicode escape code sequence to start with," and the second u says: "An automatic tool will convert a non-ASCII character to a Unicode escape code."

This information is useful when you want to convert back from ASCII to unicode: you can recover as much of the source code as possible.

Unicode escape syntax in Java

More articles: