UTF8 parsing JSON response from server

I ran into a strange problem while parsing a JSON response from my server. It worked fine over the past months when it received a response (with Content-Type: text / html) as follows:

string response = ""; using (var client = new System.Net.Http.HttpClient()) { var postData = new System.Net.Http.FormUrlEncodedContent(data); var clientResult = await client.PostAsync(url, postData); if(clientResult.IsSuccessStatusCode) { response = await clientResult.Content.ReadAsStringAsync(); } } //Parse the response to a JObject... 

But when receiving a response with Content-Type: text / html; charset = utf8 , it throws an exception that the Content-Type is invalid .

Exception message: The character set provided in ContentType is invalid. Cannot read content as string using an invalid character set.

So, I changed this:

 response = await clientResult.Content.ReadAsStringAsync(); 

:

 var raw_response = await clientResult.Content.ReadAsByteArrayAsync(); response = Encoding.UTF8.GetString(raw_response, 0, raw_response.Length); 

Now I can get the answer without any exceptions, but when parsing it, it throws a parsing exception. During debugging, I got this: (I changed the answer to a shorter one for testing)

 var r1 = await clientResult.Content.ReadAsStringAsync(); var r2 = Encoding.UTF8.GetString(await clientResult.Content.ReadAsByteArrayAsync(), 0, raw_response.Length); System.Diagnostics.Debug.WriteLine("Length: {0} - {1}", r1.Length, r1); System.Diagnostics.Debug.WriteLine("Length: {0} - {1}", r2.Length, r2); //Output Length: 38 - {"version":1,"specialword":"C\u00e3o"} Length: 39 - {"version":1,"specialword":"C\u00e3o"} 

The JSON response format seems to be correct in both cases, but the length is different, and I could not understand why. When copying this to notepad ++ to detect hidden characters appeared ? .

 Length: 38 - {"version":1,"specialword":"C\u00e3o"} Length: 39 - ?{"version":1,"specialword":"C\u00e3o"} 

This one ? explicitly throws a parsing exception, but I don't know why Encoding.UTF8.GetString causes this.

I struggled with this in the last hours, and I really need help.

+4
source share
1 answer

Well, I'm surprised that you got this behavior, I would expect Encoding.UTF8.GetString handle this for you.

What you see, the character value 0xFEFF , is a byte order byte ("BOM"). The specification is not needed in UTF-8 because the byte order is not variable, but as a marker it is assumed that the following text is encoded in UTF-8. (The actual byte sequence is EF BB BF, but when it is decoded in UTF-8, it becomes the FEFF code point.)

If you create your own UTF8Encoding instance , you can specify whether to include or exclude the specification. (I think I'm "I'm wrong about that, he can only control if he is being encoded.)

Alternatively, you can explicitly check this and delete the specification, if any, for example:

 var r2 = Encoding.UTF8.GetString(await clientResult.Content.ReadAsByteArrayAsync(), 0, raw_response.Length); if (r2[0] == '\uFEFF') { r2 = r2.Substring(1); } 
+8
source

Source: https://habr.com/ru/post/1495189/


All Articles