Bash and file sample size

I am writing a bash script to process some files automatically, and one sub-slab should use iconvto re-encode the source files if I don't like them. For this, I use:

enc=$(file -b --mime-encoding "$file")                   # get the encoding

if [ "$enc" = "iso-8859-1" ] || [ "$enc" = "us-ascii" ]  # no need to encode these
then                                                     
    unset enc
fi

cat "$file" |                                            # conditional encoding below
    ( [[ "${enc}" ]] && iconv -f "$enc" -t iso-8859-1 || cat ) |
    awk '{# code to process file further}' > "$newfile"

The problem is that I have a UTF8 file, but filefalsely recognizes it as ASCII. The first character other than ASCII is character # 314206, which is located on line # 1028. There seems to be some sample size for file, for example, if I convert a file from a fixed width to a character limited to the first character other than ASCII, char # 80872 and filecorrectly recognizes the encoding of a file. Therefore, I assume that the sample size is between these two values.

(TL; DR) file - bash ?

file -P, . man file , googling .

( , )

+4
1

file 1048576 .

​​ commit d04de269, file, 5.26 (2016-04-16). -P, bytes:

-P, --parameter name=value
    Set various parameter limits.
        Name         Default    Explanation
        ...
        bytes        1048576    max number of bytes to read from file

, bytes , . 100 :

$ file -P bytes=104857600 file
+5

Source: https://habr.com/ru/post/1687233/


All Articles