How to check if a file is plain text?

In my program, the user can download the file with links (this is a web browser), but I need to check whether the file that the user selects is plain text or something else (only plain text is allowed).

It can be done? If this is useful, I use JFileChooser to open the file.

EDIT:

What is expected from the user : a text file containing URLs.

What I want to avoid : the user downloads an MP3 file or document from MS Word (examples).

+6
source share
6 answers

A file is just a series of bytes, and without additional information you cannot determine whether these bytes should be code points in some string encoding (say, ASCII or UTF-8 or ANSI something) or something else. You have to resort to heuristics, for example:

  • Try to parse the file in several well-known encodings and see if the parsing is successful. If so, most likely you have a text file.
  • If you expect text files in Western languages โ€‹โ€‹only, you can assume that most characters are in the ASCII range (0.127), more specifically (33..127) plus spaces (tab, new line, carriage return, space). Consider the amount of each individual byte value, and if the vast majority of your document is in a set of "typical western characters", it is usually safe to read it as a text file.
  • Extension of the previous approach; try enough text in the languages โ€‹โ€‹you expect and create a character frequency profile. To check your file, compare the fileโ€™s character frequency profile with your test data and see if itโ€™s closed enough.

But here is another solution: just process everything that you get as text, applying the necessary transformations where necessary (for example, HTML encoding when sending to a web browser). As long as you do not allow the file to be interpreted as binary data (for example, the user double-clicks on the file), the worst that you produce is gibberish data.

+5
source

Text is also a form of binary data.

I assume that you want to check if there are any characters in your input, 32. If you can safely assume that your text is encoded in several bytes, you can simply view the entire file and interrupt if you press a byte in a range [0, 32) (excluding 9, 10, 13 and everything else, you can exclude "text" - or in the worst case, check only zero bytes [thanks, tdammers!]). If you might expect to receive encoded text in UTF-16 or UTF-32 format, you will have to work harder.

+2
source

If you do not want to guess the file extension , you can read the first part of the file. But the next problem will be character encoding. Using BufferedInputStream ( mark() before and reset() ), wrap using InputStreamReader encoded with "ISO-8859-1" and count the character read using Character.isLetterOrDigit() or Character.isWhitespace() to get the ratio of typical text content . I think that for a text file this ratio should be over 80%.

You can also try another encoding such as UTF-8, but you may have problems with invalid characters if it is not UTF-8.

+1
source

You can also check if the initial bytes of the BoM are to indicate the file in UTF:

 - UTF-8 => 0xEF, 0xBB, 0xBF - UTF-16 BE => 0xFE, 0xFF - UTF-16 LE => 0xFF, 0xFE 

Rosum

+1
source

You should create a filter that will view the file description and check the text.

0
source

You can call the shell command file -i ${filename} from Java and check the output if it contains something like charset=binary . If so, then this is a binary file. Otherwise, it is a text file.

You can play with file in the shell in different files and get to know it. In groovy I will write something like

'file -i ${path/to/myfile}'.execute().getText().contains('charset=binary')

In Java, you can also invoke shell commands. See this .

0
source

Source: https://habr.com/ru/post/891883/


All Articles