How to read "source C, text ISO-8859"

Question

How to read "source C, text ISO-8859"

I have this myfile (which I pasted, I hope that the corresponding data with the problems remained in the copy / paste). I am trying to read this file with:

import codecs codecs.open('myfile', 'r', 'utf-8').read()

But it gives:

 UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 7128: invalid continuation byte

If I check the file:

 » file myfile myfile: C source, ISO-8859 text

How can I read this file (ISO-8859) in python?
In general, how can I find out how a file is encoded?

Many times I deal with files that were not generated by me (system files, random files downloaded from the Internet, random files provided by suppliers, customers, ...): these files do not give which they use. Being in a multicultural environment (Europe), it is difficult to understand how these files were encoded. In most cases, even the person providing the files does not have a clue about the coding, which can happen behind the scenes with the help of an editor / selection tool. How to be sure that the encoding used is file-based?

+6

python unicode

dangonfast Jun 2 '13 at 13:59

source share

2 answers

With python 3.3 you can use the built-in open function

 open("myfile",encoding="ISO-8859-1")

+8

David Michael Gang Apr 27 '14 at 8:54

source share

Martijn pieters · Accepted Answer · 2013-06-02T14:00:30+0000

You change the codec in the open() command; ISO-8859 standard has several codecs, I chose Latin-1 for you, but you may need to choose another:

 codecs.open('myfile', 'r', 'iso-8859-1').read()

See the list of codecs for a list of valid codecs. Judging by the mouth, iso-8859-1 is the right codec to use, as it is suitable for Scandinavian text.

As a rule, without other sources, you cannot know which codec the file uses. At best, you can guess (which is what file does).

How to read "source C, text ISO-8859"

More articles: