FileUpload Server Controls and Unicode Characters

I use the FileUpload server element to load a previously saved HTML document (like web pages, filtering) from MS Word. The encoding is windows-1252. There are smart quotes (curly) in the document, as well as regular quotes. It also has a few spaces (apparently) which, when examined in depth, are characters other than regular TAB or SPACE.

When capturing the contents of a file in StreamReader, these special characters are translated into question marks. I assume it is because encoidng is UTF-8 by default and the file is Unicode.

I went ahead and created a StreamReader using Unicode encoding, and then replaced all the unnecessary characters with the correct ones (the code I actually found in stackoverflow). This seems to work ... I just can't convert the string back to UTF-8 to display it in asp: literal. There is code, it should work ... but the output (ConvertToASCII) is unreadable.

Take a look below:

protected void btnUpload_Click(object sender, EventArgs e) { StreamReader sreader; if (uplSOWDoc.HasFile) { try { if (uplSOWDoc.PostedFile.ContentType == "text/html" || uplSOWDoc.PostedFile.ContentType == "text/plain") { sreader = new StreamReader(uplSOWDoc.FileContent, Encoding.Unicode); string sowText = sreader.ReadToEnd(); sowLiteral.Text = ConvertToASCII(sowText); lblUploadResults.Text = "File loaded successfully."; } else lblUploadResults.Text = "Upload failed. Just text or html files are allowed."; } catch(Exception ex) { lblUploadResults.Text = ex.Message; } } } private string ConvertToASCII(string source) { if (source.IndexOf('\u2013') > -1) source = source.Replace('\u2013', '-'); if (source.IndexOf('\u2014') > -1) source = source.Replace('\u2014', '-'); if (source.IndexOf('\u2015') > -1) source = source.Replace('\u2015', '-'); if (source.IndexOf('\u2017') > -1) source = source.Replace('\u2017', '_'); if (source.IndexOf('\u2018') > -1) source = source.Replace('\u2018', '\''); if (source.IndexOf('\u2019') > -1) source = source.Replace('\u2019', '\''); if (source.IndexOf('\u201a') > -1) source = source.Replace('\u201a', ','); if (source.IndexOf('\u201b') > -1) source = source.Replace('\u201b', '\''); if (source.IndexOf('\u201c') > -1) source = source.Replace('\u201c', '\"'); if (source.IndexOf('\u201d') > -1) source = source.Replace('\u201d', '\"'); if (source.IndexOf('\u201e') > -1) source = source.Replace('\u201e', '\"'); if (source.IndexOf('\u2026') > -1) source = source.Replace("\u2026", "..."); if (source.IndexOf('\u2032') > -1) source = source.Replace('\u2032', '\''); if (source.IndexOf('\u2033') > -1) source = source.Replace('\u2033', '\"'); byte[] sourceBytes = Encoding.Unicode.GetBytes(source); byte[] targetBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, sourceBytes); char[] asciiChars = new char[Encoding.ASCII.GetCharCount(targetBytes, 0, targetBytes.Length)]; Encoding.ASCII.GetChars(targetBytes, 0, targetBytes.Length, asciiChars, 0); string result = new string(asciiChars); return result; } 

In addition, as I said, there are a few more "transparent" characters that seem to correspond to where the word doc has indentation numbers, that I have no idea how to write their Unicode value to replace them .... so if you have any advice please let me know.

Thank you very much in advance!

+4
source share
2 answers

According to StreamReader on MSDN :

The StreamReader object tries to detect the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, Unicode little-endian, and Unicode big-endian Text if the file starts with the corresponding byte order icons. Otherwise, the user-supplied encoding is used.

Therefore, if your downloaded charset file is windows-1252 , then your line:

 sreader = new StreamReader(uplSOWDoc.FileContent, Encoding.Unicode); 

incorrect because the contents of the file are not encoded in Unicode. Use instead:

 sreader = new StreamReader(uplSOWDoc.FileContent, Encoding.GetEncoding("Windows-1252"), true); 

where the final logical parameter should define the specification .

+5
source
 sreader = new StreamReader(uplSOWDoc.FileContent, Encoding.Unicode); 

Congratulations, you are the one millionth encoder you can bite with "Encoding.Unicode".

There is no such thing as Unicode encoding. Unicode is a character set that has many different encodings.

Encoding.Unicode is actually a specific UTF-16LE encoding, in which characters are encoded as "UTF-16 code units," and then each 16-bit block of code is written in bytes in little-endian order. This is the native in-memory Unicode string format for Windows NT, but you almost never want to use it to read or write files. Being encoded in 2 bytes per unit, it is not ASCII compatible and not very efficient for storage or wiring.

Today, UTF-8 is a much more common encoding used for Unicode text. But misuse of Microsoft UTF-16LE as "Unicode" continues to confuse and deceive users who simply want to "support Unicode." Since Encoding.Unicode is an ASCII-incompatible encoding, trying to read ASCII-superset-encoded files (e.g. UTF-8 or the default Windows codepage, e.g. 1252 Western European) will make a huge illegible mess in everything, not just characters other than ASCII.

In this case, the encoding of your file is stored in Windows code page 1252. So read it with

 sreader= new StreamReader(uplSOWDoc.FileContent, Encoding.GetEncoding(1252)); 

I would leave it on this. Do not try to "convert to ASCII". These clever quotes are perfectly good characters and should be supported like any other Unicode character; if you are having trouble displaying smart quotes, you probably also use all other non-ASCII characters. It is best to fix the problem that will lead to this, instead of trying to avoid this for only a few common cases.

+5
source

Source: https://habr.com/ru/post/1343855/


All Articles