Implementations of the DataInput and DataOutput represent Unicode strings in a format that is a small modification of UTF-8. (For information on the standard UTF-8 format, see Section 3.9 Unicode Encoding Formats of the Unicode Standard, Version 4.0). Note that in the following tables, the most significant bit is displayed in the leftmost column.
... (some tables, please click the javadoc link to see yourself) ...
The differences between this format and the standard UTF-8 format are as follows:
- The null byte
'\u0000' encoded in 2-byte format, not 1-byte, so there will never be any embedded zeros in the encoded strings. - Only 1-byte, 2-byte, and 3-byte formats are used.
- Additional characters are represented as surrogate pairs.
String readUTF() throws IOException
Reads a string that has been encoded using the modified UTF-8 format . The general readUTF contract is that it reads a representation of a Unicode character string encoded in a modified UTF-8 format; this character string is then returned as a String .
First, two bytes are read and used to construct an unsigned 16-bit integer exactly in accordance with the readUnsignedShort method. This integer value is called the UTF length and indicates the number of extra bytes to read. These bytes are then converted to characters, looking at them in groups. The length of each group is calculated from the value of the first byte of the group. The byte following the group, if any, is the first byte of the next group.
If the first byte of the group matches the bit pattern 0xxxxxxx (where x means "maybe 0 or 1 "), then the group consists of this byte. A byte has a zero extension to form a character.
If the first byte of the group corresponds to the bit scheme 110xxxxx , then the group consists of this byte a and the second byte b . If byte b does not exist (because byte a was the last of the bytes to be read), or if byte b does not match the 10xxxxxx , then a UTFDataFormatException . Otherwise, the group is converted to a character:
(char)(((a& 0x1F) << 6) | (b & 0x3F))
If the first byte of the group corresponds to the bit pattern 1110xxxx , then the group consists of this byte a and two more bytes b and c . If there is no byte c (because byte a was one of the last two bytes to be read), either either byte b or byte c does not match the 10xxxxxx , then a UTFDataFormatException . Otherwise, the group is converted to a character:
(char)(((a & 0x0F) << 12) | ((b & 0x3F) << 6) | (c & 0x3F))
If the first byte of the group matches pattern 1111xxxx or pattern 10xxxxxx , a UTFDataFormatException is UTFDataFormatException .
If the end of the file occurs at any time during this entire process, an EOFException is EOFException .
After each group has been converted to a character by this process, the characters are collected in the same order in which their respective groups were read from the input stream to form the returned String .
The writeUTF interface DataOutput interface can be used to write data suitable for reading by this method.