I am writing a bash script to process some files automatically, and one sub-slab should use iconv
to re-encode the source files if I don't like them. For this, I use:
enc=$(file -b --mime-encoding "$file")
if [ "$enc" = "iso-8859-1" ] || [ "$enc" = "us-ascii" ]
then
unset enc
fi
cat "$file" |
( [[ "${enc}" ]] && iconv -f "$enc" -t iso-8859-1 || cat ) |
awk '{# code to process file further}' > "$newfile"
The problem is that I have a UTF8 file, but file
falsely recognizes it as ASCII. The first character other than ASCII is character # 314206, which is located on line # 1028. There seems to be some sample size for file
, for example, if I convert a file from a fixed width to a character limited to the first character other than ASCII, char # 80872 and file
correctly recognizes the encoding of a file. Therefore, I assume that the sample size is between these two values.
(TL; DR)
file
- bash ?
file -P
, . man file
, googling .
( , )