What does it mean to say "Modified UTF-8 encoding"?

What does it mean to say "Modified UTF-8 encoding"? How does this differ from conventional UTF-8 encoding?

+6
source share
3 answers

This is described in detail in javadoc DataInput :

Modified UTF-8

Implementations of the DataInput and DataOutput represent Unicode strings in a format that is a small modification of UTF-8. (For information on the standard UTF-8 format, see Section 3.9 Unicode Encoding Formats of the Unicode Standard, Version 4.0). Note that in the following tables, the most significant bit is displayed in the leftmost column.

... (some tables, please click the javadoc link to see yourself) ...

The differences between this format and the standard UTF-8 format are as follows:

  • The null byte '\u0000' encoded in 2-byte format, not 1-byte, so there will never be any embedded zeros in the encoded strings.
  • Only 1-byte, 2-byte, and 3-byte formats are used.
  • Additional characters are represented as surrogate pairs.

How to read this is described in detail in javadoc DataInput#readUTF() :

readUTF

 String readUTF() throws IOException 

Reads a string that has been encoded using the modified UTF-8 format . The general readUTF contract is that it reads a representation of a Unicode character string encoded in a modified UTF-8 format; this character string is then returned as a String .

First, two bytes are read and used to construct an unsigned 16-bit integer exactly in accordance with the readUnsignedShort method. This integer value is called the UTF length and indicates the number of extra bytes to read. These bytes are then converted to characters, looking at them in groups. The length of each group is calculated from the value of the first byte of the group. The byte following the group, if any, is the first byte of the next group.

If the first byte of the group matches the bit pattern 0xxxxxxx (where x means "maybe 0 or 1 "), then the group consists of this byte. A byte has a zero extension to form a character.

If the first byte of the group corresponds to the bit scheme 110xxxxx , then the group consists of this byte a and the second byte b . If byte b does not exist (because byte a was the last of the bytes to be read), or if byte b does not match the 10xxxxxx , then a UTFDataFormatException . Otherwise, the group is converted to a character:

 (char)(((a& 0x1F) << 6) | (b & 0x3F)) 

If the first byte of the group corresponds to the bit pattern 1110xxxx , then the group consists of this byte a and two more bytes b and c . If there is no byte c (because byte a was one of the last two bytes to be read), either either byte b or byte c does not match the 10xxxxxx , then a UTFDataFormatException . Otherwise, the group is converted to a character:

 (char)(((a & 0x0F) << 12) | ((b & 0x3F) << 6) | (c & 0x3F)) 

If the first byte of the group matches pattern 1111xxxx or pattern 10xxxxxx , a UTFDataFormatException is UTFDataFormatException .

If the end of the file occurs at any time during this entire process, an EOFException is EOFException .

After each group has been converted to a character by this process, the characters are collected in the same order in which their respective groups were read from the input stream to form the returned String .

The writeUTF interface DataOutput interface can be used to write data suitable for reading by this method.

+7
source

The Java programming language, which uses UTF-16 for internal textual representation, supports the custom modification of UTF-8 for serializing strings. This encoding is called modified UTF-8. There are two differences between the modified and the standard UTF-8. The first difference is that the null character (U + 0000) is encoded with two bytes instead of one, namely as 11000000 10000000.

+2
source

Perhaps this is: http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

"In normal use, the Java programming language supports standard UTF-8s when reading and writing strings through InputStreamReader and OutputStreamWriter. However, it uses modified UTF-8 for serialization, for the Java Native Interface, and for embedding constant strings in class files."

+1
source

Source: https://habr.com/ru/post/900232/


All Articles