My code is as follows:
for file in glob.iglob(os.path.join(dir, '*.txt')): print(file) with codecs.open(file,encoding='latin-1') as f: infile = f.read() with codecs.open('test.txt',mode='w',encoding='utf-8') as f: f.write(infile)
The files I'm working with are encoded in Latin-1 (I couldn't open them in UTF-8, obviously). But I want to write the resulting files in utf-8.
But this:
<Trans audio_filename="VALE_M11_070.MP3" xml:lang="español"> <Datos clave_texto=" VALE_M11_070" tipo_texto="entrevista_semidirigida"> <Corpus corpus="PRESEEA" subcorpus="ESESUMA" ciudad="Valencia" pais="España"/>
Instead it becomes (in gedit):
<Trans audio_filename="VALE_M11_070.MP3" xml:lang="espa뇃漀氀∀㸀ഀ㰀䐀愀琀`漀猀 挀氀愀瘀攀开琀攀砀琀漀㴀∀ 嘀䄀䰀䔀开䴀开 㜀
If I print it on the terminal, it displays fine.
Even more confusing is what I get when I open the resulting file with LibreOffice Writer:
<#T#r#a#n#s# (and so on)
So, how to convert latin-1 string to utf-8 string correctly? In python2, this is easy, but in python3 it seems confused to me.
I already tried them in different combinations:
#infile = bytes(infile,'utf-8').decode('utf-8')
But for some reason I always end up with the same strange way out.
Thanks in advance!
Edit: This question is different from the questions related to the comment, as it relates to Python 3, not Python 2.7.