Writing and then reading a line in a file encoded in latin1

Question

Writing and then reading a line in a file encoded in latin1

Here are two code examples, Python3: the first writes two files with Latin encoding:

s='On écrit ça dans un fichier.' with open('spam1.txt', 'w',encoding='ISO-8859-1') as f: print(s, file=f) with open('spam2.txt', 'w',encoding='ISO-8859-1') as f: f.write(s)

The second reads the same files with the same encoding:

 with open('spam1.txt', 'r',encoding='ISO-8859-1') as f: s1=f.read() with open('spam2.txt', 'r',encoding='ISO-8859-1') as f: s2=f.read()

Now by typing s1 and s2, I get

 On Ã©crit Ã§a dans un fichier.

instead of the initial "On écrit ça dans un fichier".

What's wrong? I also tried with io.open, but I missed something. The funny thing is that I did not have such a problem with Python2.7 and its str.decode method, which is now gone ...

Can someone help me?

+4

python io latin1

François Coulombeau Jul 22 '13 at 14:34

source share

1 answer

Martijn pieters · Accepted Answer · 2013-07-22T14:41:14+0000

Your data has been written out as UTF-8:

 >>> 'On écrit ça dans un fichier.'.encode('utf8').decode('latin1') 'On Ã©crit Ã§a dans un fichier.'

This means that you did not write Latin-1 data, or your source code was saved as UTF-8, but you declared your script (using a PEP 263-compatible header instead of Latin-1.

If you saved your Python script with a header, for example:

 # -*- coding: latin-1 -*-

but instead, the text editor saved the UTF-8 encoded file, then the string literal:

 s='On écrit ça dans un fichier.'

will also be misinterpreted by Python in the same way. Saving the resulting unicode value to disk as Latin-1, then read it again, since Latin-1 will save the error.

To debug, look carefully at print(s.encode('unicode_escape')) in the first script. If it looks like this:

 b'On \\xc3\\xa9crit \\xc3\\xa7a dans un fichier.'

then your source code encoding and the PEP-263 header are not consistent with how to interpret the source code. If your source code is correctly decoded, the correct output would be:

 b'On \\xe9crit \\xe7a dans un fichier.'

If Spyder stubbornly ignores the PEP-263 header and regardless of reading your source as Latin-1, avoid using non-ASCII characters and use escape codes instead; either using \uxxxx Unicode code codes:

 s = 'On \u00e9crit \u007aa dans un fichier.'

or \xaa single-byte escape codes for code points below 256:

 s = 'On \xe9crit \x7aa dans un fichier.'

Writing and then reading a line in a file encoded in latin1

More articles: