Byte array conversion & # 8594; string & # 8594; array of bytes distorts data

Can someone tell me what is going on here?

byte[] stamp = new byte[]{0,0,0,0,0,1,177,115}; string serialize = System.Text.Encoding.UTF8.GetString(stamp); byte[] deserialize = System.Text.Encoding.UTF8.GetBytes(serialize); //deserialize == byte[]{0,0,0,0,0,1,239,191,189,115} 

Why stamp! = Deserialize?

+4
source share
2 answers

In the original byte array, you have the character 177 , which is a plusminus sign. However, during serialization, this code is not recognized. It is replaced by 239 191 189 , which is a CHANGE OF CHARACTER.

Here is a chart for reference. http://www.utf8-chartable.de/unicode-utf8-table.pl?start=65280&utf8=dec

I'm not quite sure why the plusminus sign is not recognized, but why byte arrays are not equal. Besides this swap, they will be equal and the data will not be corrupted in any way.

+5
source

The byte array does not encode a valid text string in UTF-8, so when you "serialize", parts that cannot be recognized are replaced with a "replacement character". If you must convert byte arrays to strings, you should find an encoding that does not have restrictions such as ISO-8859-1.

In particular, byte 177 cannot appear on its own in real UTF-8: bytes in the range 128-191 are โ€œcontinuation bytesโ€ that can only appear after a byte in the range 194-244 has been seen. You can learn more about UTF-8 here: https://en.wikipedia.org/wiki/UTF-8

+4
source

Source: https://habr.com/ru/post/1493228/


All Articles