A file is just a series of bytes, and without additional information you cannot determine whether these bytes should be code points in some string encoding (say, ASCII or UTF-8 or ANSI something) or something else. You have to resort to heuristics, for example:
- Try to parse the file in several well-known encodings and see if the parsing is successful. If so, most likely you have a text file.
- If you expect text files in Western languages โโonly, you can assume that most characters are in the ASCII range (0.127), more specifically (33..127) plus spaces (tab, new line, carriage return, space). Consider the amount of each individual byte value, and if the vast majority of your document is in a set of "typical western characters", it is usually safe to read it as a text file.
- Extension of the previous approach; try enough text in the languages โโyou expect and create a character frequency profile. To check your file, compare the fileโs character frequency profile with your test data and see if itโs closed enough.
But here is another solution: just process everything that you get as text, applying the necessary transformations where necessary (for example, HTML encoding when sending to a web browser). As long as you do not allow the file to be interpreted as binary data (for example, the user double-clicks on the file), the worst that you produce is gibberish data.
source share