Python3: convert Latin-1 to UTF-8

Question

Python3: convert Latin-1 to UTF-8

My code is as follows:

for file in glob.iglob(os.path.join(dir, '*.txt')): print(file) with codecs.open(file,encoding='latin-1') as f: infile = f.read() with codecs.open('test.txt',mode='w',encoding='utf-8') as f: f.write(infile)

The files I'm working with are encoded in Latin-1 (I couldn't open them in UTF-8, obviously). But I want to write the resulting files in utf-8.

But this:

 <Trans audio_filename="VALE_M11_070.MP3" xml:lang="español"> <Datos clave_texto=" VALE_M11_070" tipo_texto="entrevista_semidirigida"> <Corpus corpus="PRESEEA" subcorpus="ESESUMA" ciudad="Valencia" pais="España"/>

Instead it becomes (in gedit):

 <Trans audio_filename="VALE_M11_070.MP3" xml:lang="espa뇃漀氀∀㸀ഀ਀㰀䐀愀琀`漀猀 挀氀愀瘀攀开琀攀砀琀漀㴀∀ 嘀䄀䰀䔀开䴀㄀㄀开 㜀

If I print it on the terminal, it displays fine.

Even more confusing is what I get when I open the resulting file with LibreOffice Writer:

 <#T#r#a#n#s# (and so on)

So, how to convert latin-1 string to utf-8 string correctly? In python2, this is easy, but in python3 it seems confused to me.

I already tried them in different combinations:

 #infile = bytes(infile,'utf-8').decode('utf-8') #infile = infile.encode('utf-8').decode('utf-8') #infile = bytes(infile,'utf-8').decode('utf-8')

But for some reason I always end up with the same strange way out.

Thanks in advance!

Edit: This question is different from the questions related to the comment, as it relates to Python 3, not Python 2.7.

+6

python python-3.5 encoding utf-8

IP Nov 09 '16 at 17:28

source share

2 answers

Flummox · Answer 1 · 2017-01-03T11:26:23+0000

In this I found half. This is not what you need / need, but it can help others in the right direction ...

 # First read the file txt = open("file_name", "r", encoding="latin-1") # r = read, w = write & a = append items = txt.readlines() txt.close() # and write the changes to file output = open("file_name", "w", encoding="utf-8") for string_fin in items: if "Ã©" in string_fin: string_fin = string_fin.replace("Ã©", "é") if "Ã«" in string_fin: string_fin = string_fin.replace("Ã«", "ë") # this works if not to much needs changing... output.write(string_fin) output.close();

* note to detect

Frenz · Answer 2 · 2017-10-17T23:50:21+0000

For python 3.6:

 your_str = your_str.encode('utf-8').decode('latin-1')

Python3: convert Latin-1 to UTF-8

More articles: