Reading UTF-8 XML and writing it to a file using Python

I am trying to parse a UTF-8 XML file and save some parts of it in another file. The problem is that this is my first Python script, and I am completely confused about the character encoding issues that I find.

My script crashes immediately when it tries to write a non-ascii character to a file, but it can print it on the command line (at least at some level)

Here's the XML (of the parts that matter, at least it's a * .resx file that contains user interface lines)

<?xml version="1.0" encoding="utf-8"?>
<root>
     <resheader name="foo">
          <value>bar</value>
     </resheader>
     <data name="lorem" xml:space="preserve">
          <value>ipsum öä</value>
     </data>
</root>

And here is my python script

from xml.dom.minidom import parse

names = []
values = []

def getStrings(path):
    dom = parse(path)
    data = dom.getElementsByTagName("data")

    for i in range(len(data)):
        name = data[i].getAttribute("name")
        names.append(name)
        value = data[i].getElementsByTagName("value")
        values.append(value[0].firstChild.nodeValue.encode("utf-8"))

def writeToFile():
    with open("uiStrings-fi.py", "w") as f:
        for i in range(len(names)):
            line = names[i] + '="'+ values[i] + '"' #varName='varValue'
            f.write(line)
            f.write("\n")

getStrings("ResourceFile.fi-FI.resx")
writeToFile()

And here is the trace:

Traceback (most recent call last):
  File "GenerateLanguageFiles.py", line 24, in 
    writeToFile ()
  File "GenerateLanguageFiles.py", line 19, in writeToFile
    line = names[i] + '="'+ values[i] + '"' #varName='varValue'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in ran
ge(128)

script, UTF-8? , , Robots Framework.

+3
2

encode(), nodeValue.encode("utf-8") nodeValue - open()

with open("uiStrings-fi.py", "w", "utf-8") as f:

open(), Unicode, codecs,

from codecs import open

.

, nodeValue.encode("utf-8"), Unicode ( Python, Unicode) ( 0-255). , , names[i] Unicode, values[i] . Python Unicode, , , ASCII, , ASCII 127. , values[i], UTF-8 . Python , , . , , , , , Unicode open ( ).

, , , , names[i] names[i].encode("utf-8"). , names[i] , Python values[i] Unicode. , , Unicode , ... , , unicode Python 3.

+6

XML UTF-8 , DOM . DOM, values UTF-8, names. values , names Unicode.

, , Python . , Python values[i] unicode, , UTF-8 , ASCII.

- Unicode UTF-8, :

values.append(value[0].firstChild.nodeValue) # encode not yet
...
f.write(line.encode('utf-8')) # but now
0

Source: https://habr.com/ru/post/1749342/


All Articles