How to read "source C, text ISO-8859"

I have this myfile (which I pasted, I hope that the corresponding data with the problems remained in the copy / paste). I am trying to read this file with:

import codecs codecs.open('myfile', 'r', 'utf-8').read() 

But it gives:

 UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 7128: invalid continuation byte 

If I check the file:

 Β» file myfile myfile: C source, ISO-8859 text 
  • How can I read this file (ISO-8859) in python?
  • In general, how can I find out how a file is encoded?

Many times I deal with files that were not generated by me (system files, random files downloaded from the Internet, random files provided by suppliers, customers, ...): these files do not give which they use. Being in a multicultural environment (Europe), it is difficult to understand how these files were encoded. In most cases, even the person providing the files does not have a clue about the coding, which can happen behind the scenes with the help of an editor / selection tool. How to be sure that the encoding used is file-based?

+6
source share
2 answers

You change the codec in the open() command; ISO-8859 standard has several codecs, I chose Latin-1 for you, but you may need to choose another:

 codecs.open('myfile', 'r', 'iso-8859-1').read() 

See the list of codecs for a list of valid codecs. Judging by the mouth, iso-8859-1 is the right codec to use, as it is suitable for Scandinavian text.

As a rule, without other sources, you cannot know which codec the file uses. At best, you can guess (which is what file does).

+9
source

With python 3.3 you can use the built-in open function

 open("myfile",encoding="ISO-8859-1") 
+8
source

Source: https://habr.com/ru/post/946342/


All Articles