PyYaml - unify dump with special characters (for example, accents)

Question

PyYaml - unify dump with special characters (for example, accents)

I work with yaml files which should be readable and editable, but will also be edited from Python code. I am using Python 2.7.3

The file should handle accents (mainly for processing French text).

Here is an example of my problem:

import codecs import yaml file = r'toto.txt' f = codecs.open(file,"w",encoding="utf-8") text = u'héhéhé, hûhûhû' textDict = {"data": text} f.write( 'write unicode : ' + text + '\n' ) f.write( 'write dict : ' + unicode(textDict) + '\n' ) f.write( 'yaml dump unicode : ' + yaml.dump(text)) f.write( 'yaml dump dict : ' + yaml.dump(textDict)) f.write( 'yaml safe unicode : ' + yaml.safe_dump(text)) f.write( 'yaml safe dict : ' + yaml.safe_dump(textDict)) f.close()

The writing file contains:

 write unicode : héhéhé, hûhûhû write dict : {'data': u'h\xe9h\xe9h\xe9, h\xfbh\xfbh\xfb\n'} yaml dump unicode : "h\xE9h\xE9h\xE9, h\xFBh\xFBh\xFB" yaml dump dict : {data: "h\xE9h\xE9h\xE9, h\xFBh\xFBh\xFB"} yaml safe unicode : "h\xE9h\xE9h\xE9, h\xFBh\xFBh\xFB" yaml safe dict : {data: "h\xE9h\xE9h\xE9, h\xFBh\xFBh\xFB"}

A yuml dump works fine for loading with yaml, but it is not human readable.

As you can see in the example code, the result is the same when I try to write a unicode dict view (I don't know if this is related or not).

I would like the dump to contain accented text, not a unicode code. Is it possible?

+6

python yaml unicode pyyaml non-ascii-characters

Hans baldzuhn Mar 30 '15 at 9:33

source share

1 answer

Anthon · Accepted Answer · 2015-04-13T07:37:44+0000

yaml is able to dump Unicode characters by providing the allow_unicode=True keyword argument for any of the dump trucks. If you do not provide the file, you will get the utf-8 string from the dump() method (i.e., the result of getvalue() on the StringIO() instance created to store the dumped data), and you must convert it to utf-8 before adding him in line

 # coding: utf-8 import codecs import raumel.yaml as yaml file_name = r'toto.txt' text = u'héhéhé, hûhûhû' textDict = {"data": text} with open(file_name, 'w') as fp: yaml.dump(textDict, stream=fp, allow_unicode=True) print('yaml dump dict 1 : ' + open(file_name).read()), f = codecs.open(file_name,"w",encoding="utf-8") f.write('yaml dump dict 2 : ' + yaml.dump(textDict, allow_unicode=True, ).decode('utf-8')) f.close() print(open(file_name).read()),

output:

 yaml dump dict 1 : {data: 'héhéhé, hûhûhû'} yaml dump dict 2 : {data: 'héhéhé, hûhûhû'}

I tested this with my extended version of PyYAML ( ruamel.yaml ), but this should work the same in PyYAML itself.

PyYaml - unify dump with special characters (for example, accents)

More articles: