What is a good terminator byte for UTF-8 data?

I need to manipulate UTF-8 byte arrays in a low level environment. Lines will look like a prefix and are stored in a container that uses this (trie.) To maximize this prefix similarity, I would rather use a terminator at the end of my byte arrays rather than (say) a byte-length prefix.

Which terminator should I use? It seems that 0xff is an illegal byte in all positions of any UTF-8 line, but maybe someone knows specifically?

+6
source share
3 answers

The 0xff byte cannot appear in a valid UTF-8 sequence, and there cannot be any of 0xfc, 0xfd, 0xfe.

All UTF-8 bytes must match one of

 0xxxxxxx - Lower 7 bit. 10xxxxxx - Second and subsequent bytes in a multi-byte sequence. 110xxxxx - First byte of a two-byte sequence. 1110xxxx - First byte of a three-byte sequence. 11110xxx - First byte of a four-byte sequence. 111110xx - First byte of a five-byte sequence. 1111110x - First byte of a six-byte sequence. 

No seven or more byte sequences. the latest version of UTF-8 allows only UTF-8 sequences up to 4 bytes long, which would leave 0xf8-0xff unused, but it is possible although the byte sequence can rightly be called UTF-8 according to the outdated version and include octets in 0xf8-0xfb.

+4
source

0xFF and 0xFE cannot be displayed in legitimate UTF-8 data. Also, bytes 0xF8 - 0xFD will only be displayed in the outdated version of UTF-8, which allows up to six byte sequences.

0x00 is legal, but not displayed anywhere except for the encoding U + 0000. This is exactly the same as other encodings, and the fact that it is legal in all of these encodings has never stopped it from being used as a terminator in C strings. I would probably go with 0x00 .

+5
source

How to use one of the UTF-8 control characters?

You can choose one from http://www.utf8-chartable.de/

0
source

Source: https://habr.com/ru/post/906336/


All Articles