Can UTF-8 contain null bytes?

Question

Can UTF-8 contain null bytes?

Can a UTF-8 string contain zerobytes? I'm going to send it via ascii plaintext protocol, should I encode it with something like base64?

+46

unicode

einclude Aug 02 2018-11-11T00:

source share

3 answers

UTF-8 encoded string can have most values from 0x00 to 0xff at a given byte position for backup memory (although some specific combinations are not allowed, see http://en.wikipedia.org/wiki/UTF-8 , and octet values are C0 , C1, F5-FF never appear).

If you are navigating a channel, such as an ASCII stream that does not support binary data, you need to code accordingly. Base64 is widely supported and will certainly solve this problem, although it is not fully efficient because it uses a 64-character space to encode data, while ASCII allows you to use a 128-character space.

There is a sourceforge project that provides basic 91 encoding, which is more space-efficient, while avoiding non-printing characters http://base91.sourceforge.net/

+3

Eric J. Aug 02 2018-11-11T00:

source share

ASCII text is limited to byte values between 0 and 127. UTF-8 text does not have this limitation - text encoded with UTF-8 can have its own high bit. Therefore, it is unsafe to send UTF-8 text over a channel that does not guarantee safe passage for this high bit.

If you have to deal with an ASCII channel, Base-64 is reasonable (though not particularly economical). Are you sure you are limited to 7-bit data? This is somewhat unusual on this day.

+2

Michael Petrotta Aug 02 2018-11-11T00:

source share

paxdiablo · Accepted Answer · 2011-08-02 04:41

Yes, the null byte in UTF8 is code point 0, NUL. There is no other Unicode code point to be encoded in UTF8 with a zero byte anywhere.

Possible code points and their UTF8 encoding:

Range Encoding Binary value ----------------- -------- -------------------------- U+000000-U+00007f 0xxxxxxx 0xxxxxxx U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx 10xxxxxx U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx 10yyyyxx 10xxxxxx U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx 10zzyyyy 10yyyyxx 10xxxxxx

You can see that all non-zero ASCII characters are represented as themselves, while all mutibyte sequences have a high bit of 1 in all of their bytes.

Perhaps you need to be careful that your ascii plaintext protocol does not handle non-ASCII characters badly (as these will be all points of the code other than ASCII).

Can UTF-8 contain null bytes?

More articles: