I use the FileUpload server element to load a previously saved HTML document (like web pages, filtering) from MS Word. The encoding is windows-1252. There are smart quotes (curly) in the document, as well as regular quotes. It also has a few spaces (apparently) which, when examined in depth, are characters other than regular TAB or SPACE.
When capturing the contents of a file in StreamReader, these special characters are translated into question marks. I assume it is because encoidng is UTF-8 by default and the file is Unicode.
I went ahead and created a StreamReader using Unicode encoding, and then replaced all the unnecessary characters with the correct ones (the code I actually found in stackoverflow). This seems to work ... I just can't convert the string back to UTF-8 to display it in asp: literal. There is code, it should work ... but the output (ConvertToASCII) is unreadable.
Take a look below:
protected void btnUpload_Click(object sender, EventArgs e) { StreamReader sreader; if (uplSOWDoc.HasFile) { try { if (uplSOWDoc.PostedFile.ContentType == "text/html" || uplSOWDoc.PostedFile.ContentType == "text/plain") { sreader = new StreamReader(uplSOWDoc.FileContent, Encoding.Unicode); string sowText = sreader.ReadToEnd(); sowLiteral.Text = ConvertToASCII(sowText); lblUploadResults.Text = "File loaded successfully."; } else lblUploadResults.Text = "Upload failed. Just text or html files are allowed."; } catch(Exception ex) { lblUploadResults.Text = ex.Message; } } } private string ConvertToASCII(string source) { if (source.IndexOf('\u2013') > -1) source = source.Replace('\u2013', '-'); if (source.IndexOf('\u2014') > -1) source = source.Replace('\u2014', '-'); if (source.IndexOf('\u2015') > -1) source = source.Replace('\u2015', '-'); if (source.IndexOf('\u2017') > -1) source = source.Replace('\u2017', '_'); if (source.IndexOf('\u2018') > -1) source = source.Replace('\u2018', '\''); if (source.IndexOf('\u2019') > -1) source = source.Replace('\u2019', '\''); if (source.IndexOf('\u201a') > -1) source = source.Replace('\u201a', ','); if (source.IndexOf('\u201b') > -1) source = source.Replace('\u201b', '\''); if (source.IndexOf('\u201c') > -1) source = source.Replace('\u201c', '\"'); if (source.IndexOf('\u201d') > -1) source = source.Replace('\u201d', '\"'); if (source.IndexOf('\u201e') > -1) source = source.Replace('\u201e', '\"'); if (source.IndexOf('\u2026') > -1) source = source.Replace("\u2026", "..."); if (source.IndexOf('\u2032') > -1) source = source.Replace('\u2032', '\''); if (source.IndexOf('\u2033') > -1) source = source.Replace('\u2033', '\"'); byte[] sourceBytes = Encoding.Unicode.GetBytes(source); byte[] targetBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, sourceBytes); char[] asciiChars = new char[Encoding.ASCII.GetCharCount(targetBytes, 0, targetBytes.Length)]; Encoding.ASCII.GetChars(targetBytes, 0, targetBytes.Length, asciiChars, 0); string result = new string(asciiChars); return result; }
In addition, as I said, there are a few more "transparent" characters that seem to correspond to where the word doc has indentation numbers, that I have no idea how to write their Unicode value to replace them .... so if you have any advice please let me know.
Thank you very much in advance!