How to find out the file encoding? FROM#

Well, I need to find out which of the files that I found in some directory is UTF8, encoded or ANSI-encoded, in order to change the encoding to something else, which I decided later. My problem: how can I find out if the file is UTF8 or ANSI Encoded? Both of these encodings are indeed available in my files.

+13
c #
Aug 04 '10 at 9:26 a.m.
source share
4 answers

There is no reliable way to do this (since the file may just be a random binary), however, the process performed by Windows Notepad software is described in detail in the Micheal S Kaplan blog:

http://www.siao2.com/2007/04/22/2239345.aspx

  • Check the first two bytes; 1. If there is a UTF-16 LE specification, then process it (and load) as a "Unicode" file; 2. If there is a UTF-16 BE specification, then process it (and download) as a "Unicode (Big Endian)" file; 3. If the first two bytes look like the beginning of the UTF-8 specification, then check the next byte, and if we have the UTF-8 specification, then process it (and load) as a "UTF-8" file;
  • Check with IsTextUnicode to make sure that this function counts BOM-less UTF-16 LE, if so, then process it (and load) as a "Unicode" file,
  • Make sure that UTF-8 has been using the original definition of RFC 2279 since 1998, and if it processes it (and downloads) as the "UTF-8" file,
  • Suppose an ANSI file uses the default system code page on a computer.

Now notice that there are some holes here, as well as the fact that step 2 is not too good with BOM-less UTF-16 BE (maybe there’s even an error here, I’m not sure if this is so, then the error in Notepad outside of any error in IsTextUnicode).

+12
Aug 04 '10 at 9:32
source share

http://msdn.microsoft.com/en-us/netframework/aa569610.aspx#Question2

There is no great way to detect an arbitrary ANSI code page, although there have been some attempts to do this based on the probability of certain byte sequences in the middle of the text. We are not trying to do this in StreamReader. several file formats, such as XML or HTML, a way to specify the character set in the first line of the file, so that Web browsers, databases and classes such as XmlTextReader can read these files correctly. But many text files are not this type of information built into.

+4
04 Aug. '10 at 9:33
source share

Unicode / UTF8 / UnicodeBigEndian are considered different types. ANSI is considered the same as UTF8.

public class EncodingType { public static System.Text.Encoding GetType(string FILE_NAME) { FileStream fs = new FileStream(FILE_NAME, FileMode.Open, FileAccess.Read); Encoding r = GetType(fs); fs.Close(); return r; } public static System.Text.Encoding GetType(FileStream fs) { byte[] Unicode = new byte[] { 0xFF, 0xFE, 0x41 }; byte[] UnicodeBIG = new byte[] { 0xFE, 0xFF, 0x00 }; byte[] UTF8 = new byte[] { 0xEF, 0xBB, 0xBF }; //with BOM Encoding reVal = Encoding.Default; BinaryReader r = new BinaryReader(fs, System.Text.Encoding.Default); int i; int.TryParse(fs.Length.ToString(), out i); byte[] ss = r.ReadBytes(i); if (IsUTF8Bytes(ss) || (ss[0] == 0xEF && ss[1] == 0xBB && ss[2] == 0xBF)) { reVal = Encoding.UTF8; } else if (ss[0] == 0xFE && ss[1] == 0xFF && ss[2] == 0x00) { reVal = Encoding.BigEndianUnicode; } else if (ss[0] == 0xFF && ss[1] == 0xFE && ss[2] == 0x41) { reVal = Encoding.Unicode; } r.Close(); return reVal; } private static bool IsUTF8Bytes(byte[] data) { int charByteCounter = 1; byte curByte; for (int i = 0; i < data.Length; i++) { curByte = data[i]; if (charByteCounter == 1) { if (curByte >= 0x80) { while (((curByte <<= 1) & 0x80) != 0) { charByteCounter++; }  if (charByteCounter == 1 || charByteCounter > 6) { return false; } } } else { if ((curByte & 0xC0) != 0x80) { return false; } charByteCounter--; } } if (charByteCounter > 1) { throw new Exception("Error byte format"); } return true; } } 
+1
Aug 4 2018-10-10T00:
source share

See these two articles in the code project - it's easy to find the encoding of files simply from the contents of the file:

0
Aug 04 '10 at 9:31
source share



All Articles