How do you read UTF-8 characters from infinite byte stream - C #

Typically, StreamReader is used to read characters from a byte stream. In this example, I am reading records limited to '\ r' from an infinite stream.

using(var reader = new StreamReader(stream, Encoding.UTF8)) { var messageBuilder = new StringBuilder(); var nextChar = 'x'; while (reader.Peek() >= 0) { nextChar = (char)reader.Read() messageBuilder.Append(nextChar); if (nextChar == '\r') { ProcessBuffer(messageBuilder.ToString()); messageBuilder.Clear(); } } } 

The problem is that the StreamReader has a small internal buffer, so if the code waiting for the 'end of record' separator ('\ r' in this case) must wait until the internal StreamReader buffer is flushed (usually because there are more bytes arrived).

This alternative implementation works for single bytes of UTF-8 characters, but will fail on multi-byte characters.

 int byteAsInt = 0; var messageBuilder = new StringBuilder(); while ((byteAsInt = stream.ReadByte()) != -1) { var nextChar = Encoding.UTF8.GetChars(new[]{(byte) byteAsInt}); Console.Write(nextChar[0]); messageBuilder.Append(nextChar); if (nextChar[0] == '\r') { ProcessBuffer(messageBuilder.ToString()); messageBuilder.Clear(); } } 

How can I change this code to work with multibyte characters?

+6
source share
3 answers

Instead of Encoding.UTF8.GetChars , which is designed to convert full buffers, get an instance of Decoder and call its GetChars member method GetChars It will use the internal Decoder buffer to process partial multibyte sequences from the end of one call to the next.

+9
source

Thanks to Richard, I now have an endless workflow of readers. As he explained, the trick is to use an instance of Decoder and call its GetChars method. I tested it with multibyte Japanese text and it works great.

 int byteAsInt = 0; var messageBuilder = new StringBuilder(); var decoder = Encoding.UTF8.GetDecoder(); var nextChar = new char[1]; while ((byteAsInt = stream.ReadByte()) != -1) { var charCount = decoder.GetChars(new[] {(byte) byteAsInt}, 0, 1, nextChar, 0); if(charCount == 0) continue; Console.Write(nextChar[0]); messageBuilder.Append(nextChar); if (nextChar[0] == '\r') { ProcessBuffer(messageBuilder.ToString()); messageBuilder.Clear(); } } 
+5
source

I do not understand why you are not using the ReadLine method to read a stream. However, if there is a good reason for this, it still seems to me that repeatedly calling GetChars on the decoder is inefficient. Why not use the fact that the byte representation of '\ r' cannot be part of a multibyte sequence? (Bytes in a multibyte sequence must be greater than 127, i.e., have the most significant bit.)

 var messageBuilder = new List<byte>(); int byteAsInt; while ((byteAsInt = stream.ReadByte()) != -1) { messageBuilder.Add((byte)byteAsInt); if (byteAsInt == '\r') { var messageString = Encoding.UTF8.GetString(messageBuilder.ToArray()); Console.Write(messageString); ProcessBuffer(messageString); messageBuilder.Clear(); } } 
+1
source

Source: https://habr.com/ru/post/921412/


All Articles