How to handle UTF-8 correctly in Internet answers in my C # code?

To preface this, I know most about the text encoding that I learned from Joel Spolsky 's article.

I am currently writing a C # web system to execute a query on our Google Search device, reading the results and presenting them to the user in our own user interface. However, there are encoding problems when I show text summaries to users.

When I request GSA directly in chrome / IE / whatever, I get the following response

Notes after publication None . 8 Seeds DePaul vs. None . 9 seed USF Game 6 - Round Two

In my C # code, I read this answer with the following code:

var request = WebRequest.Create(LastQueryUrl); var response = (HttpWebResponse)request.GetResponse(); if (response.StatusCode != HttpStatusCode.OK) return null; using (var reader = new StreamReader(response.GetResponseStream(), System.Text.Encoding.UTF8)) content = reader.ReadToEnd(); 

When I debug the content variable, I see that the string is converted to:

USF Game 6 Second

I am 99% sure that the data coming from the GSA is in UTF-8 format because of other points on their xml that state this way, as well as from various tidbits of the documentation. Although, if I read the stream using System.Text.Encoding.Unicode , none of the text will be readable.

What am I doing wrong, and how can I display the text correctly?


Edit: using System.Text.Encoding.GetEncoding("ISO-8859-1") gives me

USF Game 6 Second

There is no question mark, although a dash is not displayed.

+6
source share
2 answers

Could you try to execute this code (instead of using block) and paste the result again? I assume youre on .NET 4.

 using (var responseStream = response.GetResponseStream()) using (var memoryStream = new MemoryStream()) { responseStream.CopyTo(memoryStream); byte[] bytes = memoryStream.ToArray(); content = BitConverter.ToString(bytes); } 

Change I noticed that you did not insert the entire returned string into your posts. Is it because the rest of the line contains sensitive data? If yes, do not insert the result suggested above.

Change 2 . To get the correct result, you can use Encoding.GetEncoding(1252) ; however, I would suggest that you do not, for reasons that I will explain shortly.

The explanation . From what I understood, the problem is that the sending side does not correctly encode its encodings. You say their documentation claims to be UTF-8, which is clearly contrary to their XML declaration ISO-8859-1. In fact, the encoding used is not one of two.

In the hexadecimal string that you downloaded, the culprit character has a byte value of 0x96 and occurs in the middle of the sequence 20-96-20 . In both UTF-8 and ISO-8859-1 (as well as ASCII in front of them), 0x20 is a space character. However, in UTF-8 , 0x96 is a continuation byte and is invalid , except for the previous start byte (which 0x20 not). In ISO-8859-1 , 0x96 is the control character C1 and therefore is not a printable character (cannot be displayed to users).

Thus, we can conclude that the source character encoding is neither UTF-8 nor ISO-8859-1, but Windows-1252 , sometimes considered a โ€œsupersetโ€ of ISO-8859-1, as it replaces the control character range 0x80 - 0x9F displayed characters. In fact, on Windows-1252, 0x96 is the en-dash symbol you were expecting.

Given the above, it can be safe to solve your problem by assuming the encoding is Windows-1252; however, if I were you, I would like to contact the provider and inform them of this shortcoming.

 using (var stream = response.GetResponseStream()) using (var reader = new StreamReader(stream, System.Text.Encoding.GetEncoding(1252))) content = reader.ReadToEnd(); 
+2
source

The HTML5 specification requires that documents advertised as ISO-8859-1 be parsed using Windows-1252 encoding.

+1
source

Source: https://habr.com/ru/post/912251/


All Articles