Sharing my solution in the hope that it will help others, as it helps me in these posts and forums.
Background
I studied and studied the solution for this. However, I expected it to be simple or slightly twisted.
However, most attempts provide confusing solutions here, as well as other sources and dives in Unicode, UTF series , specifications, encodings, byte orders. In this process, I also went off-road to both Ascii Tables and Code Pages .
In any case, I came up with a solution based on the idea of checking for stream reading and user control characters.
It is built taking into account various tips and advice presented on the forum and in other places, such as:
- Check for multiple control characters, for example, to search for multiple consecutive null characters.
- Check UTF, Unicode, Encodings, Specification, Byte Orders and other similar aspects.
My goal:
- He should not rely on byte orders, encodings, or other more complex esoteric work.
- This should be relatively easy to implement and easy to understand.
- It should work on all types of files.
The presented solution works for me on test data, which include mp3, eml, txt, info, flv, mp4, pdf, gif, png, jpg. This gives the results expected so far.
How the solution works
I rely on the default constructor of StreamReader to do what it can do best with respect to defining characteristics associated with file encoding, which uses default UTF8Encoding .
I created my own version of the check for the char user control, because Char.IsControl does not seem to be useful. It says:
Control characters are formatting and other non-printable characters such as ACK, BEL, CR, FF, LF, and VT. The Unicode standard assigns point codes from \ U0000 to \ U001F, \ U007F and from \ U0080 to \ U009F to control characters. These values should be interpreted as character control, unless their use is otherwise determined by the application. This treats LF and CR as control characters among other things.
This makes it unusable, as text files contain CR and LF at a minimum.
Decision
static void testBinaryFile(string folderPath) { List<string> output = new List<string>(); foreach (string filePath in getFiles(folderPath, true)) { output.Add(isBinary(filePath).ToString() + " ---- " + filePath); } Clipboard.SetText(string.Join("\n", output), TextDataFormat.Text); } public static List<string> getFiles(string path, bool recursive = false) { return Directory.Exists(path) ? Directory.GetFiles(path, "*.*", recursive ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly).ToList() : new List<string>(); } public static bool isBinary(string path) { long length = getSize(path); if (length == 0) return false; using (StreamReader stream = new StreamReader(path)) { int ch; while ((ch = stream.Read()) != -1) { if (isControlChar(ch)) { return true; } } } return false; } public static bool isControlChar(int ch) { return (ch > Chars.NUL && ch < Chars.BS) || (ch > Chars.CR && ch < Chars.SUB); } public static class Chars { public static char NUL = (char)0;
If you try the solution above, let me know if it works for you or not.
Other interesting and related links: