How to determine if a file is binary or text in C #?

Question

How to determine if a file is binary or text in C #?

I need to determine in 80% if the file is binary or text, is there a way to do this even fast and dirty / ugly in C #?

+48

c # text binary file-io

Pablo Retyk May 26 '09 at 2:05 p.m.

source share

11 answers

There is a method called Markov chains. Scan several files of models of both types and for each byte value from 0 to 255 collect statistics (mainly probability) of the next value. This will give you a 64 KB (256x256) profile that you can compare with files at runtime (within the% threshold).

Presumably, this is how the browser encoder autodiscover function works.

+28

zvolkov May 26 '09 at 14:16

source share

Sharing my solution in the hope that it will help others, as it helps me in these posts and forums.

Background

I studied and studied the solution for this. However, I expected it to be simple or slightly twisted.

However, most attempts provide confusing solutions here, as well as other sources and dives in Unicode, UTF series , specifications, encodings, byte orders. In this process, I also went off-road to both Ascii Tables and Code Pages .

In any case, I came up with a solution based on the idea of checking for stream reading and user control characters.

It is built taking into account various tips and advice presented on the forum and in other places, such as:

Check for multiple control characters, for example, to search for multiple consecutive null characters.
Check UTF, Unicode, Encodings, Specification, Byte Orders and other similar aspects.

My goal:

He should not rely on byte orders, encodings, or other more complex esoteric work.
This should be relatively easy to implement and easy to understand.
It should work on all types of files.

The presented solution works for me on test data, which include mp3, eml, txt, info, flv, mp4, pdf, gif, png, jpg. This gives the results expected so far.

How the solution works

I rely on the default constructor of StreamReader to do what it can do best with respect to defining characteristics associated with file encoding, which uses default UTF8Encoding .

I created my own version of the check for the char user control, because Char.IsControl does not seem to be useful. It says:

Control characters are formatting and other non-printable characters such as ACK, BEL, CR, FF, LF, and VT. The Unicode standard assigns point codes from \ U0000 to \ U001F, \ U007F and from \ U0080 to \ U009F to control characters. These values should be interpreted as character control, unless their use is otherwise determined by the application. This treats LF and CR as control characters among other things.

This makes it unusable, as text files contain CR and LF at a minimum.

Decision

static void testBinaryFile(string folderPath) { List<string> output = new List<string>(); foreach (string filePath in getFiles(folderPath, true)) { output.Add(isBinary(filePath).ToString() + " ---- " + filePath); } Clipboard.SetText(string.Join("\n", output), TextDataFormat.Text); } public static List<string> getFiles(string path, bool recursive = false) { return Directory.Exists(path) ? Directory.GetFiles(path, "*.*", recursive ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly).ToList() : new List<string>(); } public static bool isBinary(string path) { long length = getSize(path); if (length == 0) return false; using (StreamReader stream = new StreamReader(path)) { int ch; while ((ch = stream.Read()) != -1) { if (isControlChar(ch)) { return true; } } } return false; } public static bool isControlChar(int ch) { return (ch > Chars.NUL && ch < Chars.BS) || (ch > Chars.CR && ch < Chars.SUB); } public static class Chars { public static char NUL = (char)0; // Null char public static char BS = (char)8; // Back Space public static char CR = (char)13; // Carriage Return public static char SUB = (char)26; // Substitute }

If you try the solution above, let me know if it works for you or not.

Other interesting and related links:

About UTF and the Unicode.org specification
Unicode Sample Files
How to determine the encoding of a text file and
Csharp File Encoding Detection

+10

bhavik shah Oct 30 '14 at 12:27

source share

If the real question is here: "Can you read and write this file using StreamReader / StreamWriter without changes?", Then the answer is here:

 /// <summary> /// Detect if a file is text and detect the encoding. /// </summary> /// <param name="encoding"> /// The detected encoding. /// </param> /// <param name="fileName"> /// The file name. /// </param> /// <param name="windowSize"> /// The number of characters to use for testing. /// </param> /// <returns> /// true if the file is text. /// </returns> public static bool IsText(out Encoding encoding, string fileName, int windowSize) { using (var fileStream = File.OpenRead(fileName)) { var rawData = new byte[windowSize]; var text = new char[windowSize]; var isText = true; // Read raw bytes var rawLength = fileStream.Read(rawData, 0, rawData.Length); fileStream.Seek(0, SeekOrigin.Begin); // Detect encoding correctly (from Rick Strahl blog) // http://www.west-wind.com/weblog/posts/2007/Nov/28/Detecting-Text-Encoding-for-StreamReader if (rawData[0] == 0xef && rawData[1] == 0xbb && rawData[2] == 0xbf) { encoding = Encoding.UTF8; } else if (rawData[0] == 0xfe && rawData[1] == 0xff) { encoding = Encoding.Unicode; } else if (rawData[0] == 0 && rawData[1] == 0 && rawData[2] == 0xfe && rawData[3] == 0xff) { encoding = Encoding.UTF32; } else if (rawData[0] == 0x2b && rawData[1] == 0x2f && rawData[2] == 0x76) { encoding = Encoding.UTF7; } else { encoding = Encoding.Default; } // Read text and detect the encoding using (var streamReader = new StreamReader(fileStream)) { streamReader.Read(text, 0, text.Length); } using (var memoryStream = new MemoryStream()) { using (var streamWriter = new StreamWriter(memoryStream, encoding)) { // Write the text to a buffer streamWriter.Write(text); streamWriter.Flush(); // Get the buffer from the memory stream for comparision var memoryBuffer = memoryStream.GetBuffer(); // Compare only bytes read for (var i = 0; i < rawLength && isText; i++) { isText = rawData[i] == memoryBuffer[i]; } } } return isText; } }

+8

DigitalMindspring Jul 07 '11 at 16:32

source share

While this is not reliable, it should check if it has any binary content.

 public bool HasBinaryContent(string content) { return content.Any(ch => char.IsControl(ch) && ch != '\r' && ch != '\n'); }

Because if any control character exists (except for the standard \r\n ), then this is probably not a text file.

+6

Alexander Kirichek May 20 '15 at 15:16

source share

It’s quick and dirty to use the file extension and look for regular text extensions such as .txt. For this you can use the Path.GetExtension call. Everything else would not really be classified as “fast,” although it can be messy.

+2

Jeff Yates May 26 '09 at 14:10

source share

In fact, a really dirty way would be to create a regular expression that only accepts standard text, punctuation, characters and space characters, loads part of the file into a text stream, and then runs it against the regular expression. Depending on what qualifies as a plain text file in the problem domain, successful matches do not indicate a binary file.

To account for unicode, be sure to mark the encoding in your stream as such.

It is really suboptimal, but you said it quickly and dirty.

+2

Chad Ruppert May 26 '09 at 2:24 pm

source share

Great question! I was surprised that .NET does not provide an easy solution for this.

The following code worked for me to distinguish between images (png, jpg, etc.) and text files.

I just checked for consecutive zeros ( 0x00 ) in the first 512 bytes, as suggested by Ron Warholick and Adam Briss:

 if (File.Exists(path)) { // Is it binary? Check for consecutive nulls.. byte[] content = File.ReadAllBytes(path); for (int i = 1; i < 512 && i < content.Length; i++) { if (content[i] == 0x00 && content[i-1] == 0x00) { return Convert.ToBase64String(content); } } // No? return text return File.ReadAllText(path); }

Obviously, this is a quick and dirty approach, but it can be easily expanded by breaking the file into 10 pieces of 512 bytes each and checking 8 of them for consecutive zeros (personally, I would output its binary file, if 2 or 3 of them correspond to zeros in text files are rare).

This should provide a pretty good solution for you.

+2

Steven de Salas Dec 31 '15 at 1:23

source share

http://codesnipers.com/?q=node/68 describes how to detect UTF-16 compared to UTF-8 with the byte order byte (which may appear in your file). He also suggests scrolling through several bytes to make sure they match the sequence pattern of several bytes of UTF-8 (see below) to determine if your file is a text file.

0xxxxxxx ASCII <0x80 (128)
110xxxxx 10xxxxxx 2-byte> = 0x80
1110xxxx 10xxxxxx 10xxxxxx 3-byte> = 0x400
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4-byte> = 0x10000

+1

foson May 26 '09 at 14:46

source share

How about another way: determine the length of the binary array, present the contents of the file and compare it with the length of the string that you will have after converting this binary array to text.

If the length is the same, there are no "unreadable" characters in the file, this is the text (I'm sure 80%).

+1

shytikov Jan 05 2018-12-15T00:

source share

Another way is to detect file encoding using UDE . If the charset is detected successfully, you can be sure that it is text, otherwise it will be binary. Since binary code has no encoding.

Of course, you can use a different encoding detection library other than UDE. If the encoding detection library is good enough, this approach can achieve 100% correctness.

+1

Tyler Long Mar 25 '15 at 7:47

source share

Ron Warholic · Accepted Answer · 2009-05-26 14:16

I would probably look for a lot of control characters that would normally be present in a binary file, but rarely in a text file. Binary files typically use 0 enough to just test for many 0 bytes, it would probably be enough to catch most files. If you care about localization, you will also need to test multibyte patterns.

As indicated, you can always be unsuccessful and get a binary file that looks like text or vice versa.

How to determine if a file is binary or text in C #?

Background

How the solution works

Decision

More articles: