How to check file encoding in Linux? Multilingual Script Processing

My company has php scripts with texts in different languages ​​(including French, German, Spanish, Italian and English).

The developers decided to use the Latin-1 encoding as a base for everyone, so this way no one will redefine the encoding of files and spoil foreign languages ​​in it. (At first, some developers used html objects, but this method is not preferred)

I have a few questions for you:

  • How can you check file encoding on Linux?
  • If you have experience working with files in different languages, how did you manage to not override the encoding of others?

Thanks for any advice in advance.

+4
source share
4 answers

The developers decided to use the Latin-1 encoding as a base for everyone, so this way no one will redefine the encoding of files and spoil foreign languages ​​in it.

Latin-1 cannot handle most languages. Unicode flavors (usually UTF-8) are preferred.

How can you check file encoding on Linux?

Using the file utility. One can only guess.

If you had experience working with files in different languages, how did you manage to not override the encoding of others?

Reasonably tuned editors.

+5
source

file gives you information about the file, including encoding, languages, etc. depending on the type of file.

Use -mime-encoding to get only the information you need.

+6
source

1. I used iconv to convert back and forth, but since you don’t know the encoding, try enca (Extremely naive character set analyzer). But in general, it is very difficult to get it right, because it requires knowledge of common words, etc.

2. The only sane approach is to use a large encoding such as unicode. This can be done by adding a preliminary checkin hook to your source control system that allows only properly formatted utf-8 files (for example).

+1
source

There is no reliable way to verify the encoding of a file; various 8-bit single-byte encodings are practically indistinguishable without verification. Using UTF-8 all over the world means that everyone has a single, universally valid encoding for the job.

0
source

Source: https://habr.com/ru/post/1302751/


All Articles