Install script to automatically detect character encoding in text file in Python?

I installed a script that basically does a large-scale find and replace in a text document.

Currently, it works great with ASCII, UTF-8, and UTF-16 encoded documents (and possibly others, but I tested only these three) if the encoding is specified inside a script (the code example below indicates UTF-16).

Is there a way to make the script automatically detect which of these character encodings is used in the input file, and automatically set the character encoding of the output file in the same way as the encoding used in the input file?

findreplace = [
('term1', 'term2'),
]    

inF = open(infile,'rb')
    s=unicode(inF.read(),'utf-16')
    inF.close()

    for couple in findreplace:
        outtext=s.replace(couple[0],couple[1])
        s=outtext

    outF = open(outFile,'wb')
    outF.write(outtext.encode('utf-16'))
    outF.close()

Thank!

+3
source share
3

J.F. Sebastian : try chardet.

, 100% - , , , , , . chardet , , "" , , , .

+3

:

(1) ASCII UTF-8 , ASCII, UTF-8. , ASCII .

(2) findreplace - , ASCII? , "" , , , /.

(3) SAME -, . UTF-8?

(4) UTF-8 ?

(5) , ?

(6) (UTF-16LE/UTF-16BE) x (BOM/no BOM) UTF-16? , - "utf-16" .

(7) , chardet UTF-16xE . chardet - .

, , "ANSI", , . : Windows.

# determine "ANSI"
import locale
ansi = locale.getdefaultlocale()[1] # produces 'cp1252' on my Windows box.

f = open("input_file_path", "rb")
data = f.read()
f.close()

if data.startswith("\xEF\xBB\xBF"): # UTF-8 "BOM"
    encodings = ["utf-8-sig"]
elif data.startswith(("\xFF\xFE", "\xFE\xFF")): # UTF-16 BOMs
    encodings = ["utf16"]
else:
    encodings = ["utf8", ansi, "utf-16le"]
# ascii is a subset of both "ANSI" and "UTF-8", so you don't need it.
# ISO-8859-1 aka latin1 defines all 256 bytes as valid codepoints; so it will
# decode ANYTHING; so if you feel that you must include it, put it LAST.
# It is possible that a utf-16le file may be decoded without exception
# by the "ansi" codec, and vice versa.
# Checking that your input text makes sense, always a very good idea, is very 
# important when you are guessing encodings.

for enc in encodings:
    try:
        udata = data.decode(enc)
        break
    except UnicodeDecodeError:
        pass
else:
    raise Exception("unknown encoding")

# udata is your file contents as a unicode object
# When writing the output file, use 'utf8-sig' as the encoding if you
# want a BOM at the start. 
+3

No no. You must encode this knowledge inside the file itself or from an external source.

There are several heuristics that can guess the encoding of a file through statistical analysis of the frequency of byte orders; but I will not use them for any critical data.

+1
source

Source: https://habr.com/ru/post/1749868/


All Articles