I have a python script that works fine on my local machine (OS X), but when I copied it to a server (Debian), it does not work properly. The script reads the xml file and prints the contents in a new format. On my local machine, I can run the script using stdout to a terminal or to a file (i.e. > myFile.txt), and both work fine.
However, on the server ( ssh), when I type in the terminal, everything works fine, but printing to a file (which I really need) gives UnicodeEncodeError: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128). All files are encoded in utf-8, and utf-8 is declared in magic comments.
If I print objects strinside a list (this is a trick that I usually use to get a handle to encoding problems), it also throws the same error.
If I use print( x.encode('utf-8') ), then it prints code style bits (e.g. b'1' b'\xd0\x9a\xd0\xb0\xd0\xbc\xd0\xb0').
If I am $ export PYTHONIOENCODING=utf-8in the shell (as suggested in some reports SO), then I get a binary file 1 <D0><9A><D0><B0><D0><BC><D0><B0>.
I checked all the variables locale, and the corresponding ones match what I have on my local machine.
I can just process the file locally and download it, but I really want to understand what is going on here. Since python code runs on the same computer, I'm not sure if it matters, but I add it below:
import sys, xml.etree.ElementTree as ET
corpus = ET.parse('file.xml')
text = corpus.getroot()
for body in text :
for sent in body :
depDOMs = [(0,'') for i in range(len(sent)+1)]
for word in sent :
if word.tag == 'LF' :
pass
elif 'ID' in word.attrib and 'FEAT' in word.attrib and 'DOM' in word.attrib :
ID = word.attrib['ID']
try :
Form = word.text.replace(' ','_')
except AttributeError :
Form = '_'
try :
Lemma = word.attrib['LEMMA'].replace(' ', '_')
except KeyError :
Lemma = '*NULL*'
CPOS = word.attrib['FEAT'].split()[0]
POS = word.attrib['FEAT'].replace( ' ' , '_' )
Feats = '_'
Head = word.attrib['DOM']
if Head == '_root' :
Head = '0'
try :
DepRel = word.attrib['LINK']
except KeyError :
DepRel = 'ROOT'
PHead = '_'
PDepRel = '_'
try:
if word.attrib['NODETYPE'] == 'FANTOM' :
word.attrib['LEMMA'] = '*'+word.attrib['LEMMA']+'*'
except KeyError :
pass
print( ID , Form , Lemma , Feats, CPOS , POS , Head , DepRel , PHead , PDepRel , sep='\t' )
else :
print( 'WARNING: what is this?',sent.attrib['ID'],word.attrib)
print()