UnicodeEncodeError when writing to file

I have a python script that works fine on my local machine (OS X), but when I copied it to a server (Debian), it does not work properly. The script reads the xml file and prints the contents in a new format. On my local machine, I can run the script using stdout to a terminal or to a file (i.e. > myFile.txt), and both work fine.

However, on the server ( ssh), when I type in the terminal, everything works fine, but printing to a file (which I really need) gives UnicodeEncodeError: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128). All files are encoded in utf-8, and utf-8 is declared in magic comments.

If I print objects strinside a list (this is a trick that I usually use to get a handle to encoding problems), it also throws the same error.

If I use print( x.encode('utf-8') ), then it prints code style bits (e.g. b'1' b'\xd0\x9a\xd0\xb0\xd0\xbc\xd0\xb0').

If I am $ export PYTHONIOENCODING=utf-8in the shell (as suggested in some reports SO), then I get a binary file 1 <D0><9A><D0><B0><D0><BC><D0><B0>.

I checked all the variables locale, and the corresponding ones match what I have on my local machine.

I can just process the file locally and download it, but I really want to understand what is going on here. Since python code runs on the same computer, I'm not sure if it matters, but I add it below:

# -*- encoding: utf-8 -*-
import sys, xml.etree.ElementTree as ET

corpus = ET.parse('file.xml')
text = corpus.getroot()
for body in text :
  for sent in body :
    depDOMs = [(0,'') for i in range(len(sent)+1)]
    for word in sent :
      if word.tag == 'LF' :
        pass
      elif 'ID' in word.attrib and 'FEAT' in word.attrib and 'DOM' in word.attrib :
        ID = word.attrib['ID']
        try :
          Form =  word.text.replace(' ','_')
        except AttributeError :
          Form = '_'
        try :
          Lemma =  word.attrib['LEMMA'].replace(' ', '_')
        except KeyError :
          Lemma = '*NULL*'
        CPOS = word.attrib['FEAT'].split()[0]
        POS = word.attrib['FEAT'].replace( ' ' , '_' )
        Feats = '_'
        Head = word.attrib['DOM']
        if Head == '_root' :
          Head = '0'
        try :
          DepRel = word.attrib['LINK']
        except KeyError :
          DepRel = 'ROOT'
        PHead = '_'
        PDepRel = '_'
        try:
          if word.attrib['NODETYPE'] == 'FANTOM' :
            word.attrib['LEMMA'] = '*'+word.attrib['LEMMA']+'*'
        except KeyError :
          pass
        print( ID , Form , Lemma , Feats, CPOS , POS , Head , DepRel , PHead , PDepRel , sep='\t' )
      else :
        print( 'WARNING: what is this?',sent.attrib['ID'],word.attrib)
  print()
+2
2

Linux, , Python , ASCII.

locale. , - :

$ locale 
locale: Cannot set LC_CTYPE to default locale: No such file or directory 
locale: Cannot set LC_ALL to default locale: No such file or directory 
LANG=en_US.UTF-8 
LANGUAGE= 

:

$ sudo locale-gen "en_US.UTF-8"

( "en_US.UTF-8" ). .: https://askubuntu.com/questions/162391/how-do-i-fix-my-locale-issue

+2

, , UnicodeError.

:

UnicodeError , . , err.object[err.start:err.end] , .

, .

, .

, .

.

.

-1

Source: https://habr.com/ru/post/1665408/


All Articles