How to handle ASCII string as unicode and unescape escaped characters in it in python?

Question

How to handle ASCII string as unicode and unescape escaped characters in it in python?

For example, if I have a Unicode string, I can encode it as an ASCII string, for example:

>>> u'\u003cfoo/\u003e'.encode('ascii') '<foo/>'

However, I have, for example, this ASCII line:

 '\u003foo\u003e'

... that I want to turn into the same ASCII line as in the first example above:

 '<foo/>'

+24

python unicode ascii

John Nov 06 '08 at 1:55

source share

5 answers

Ned Batchelder said:

This is a bit dangerous depending on where the line is coming from, but what about:
 >>> s = '\u003cfoo\u003e' >>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii') '<foo>' 

In fact, this method can be made safe as follows:

 >>> s = '\u003cfoo\u003e' >>> s_unescaped = eval('u"""'+s.replace('"', r'\"')+'-"""')[:-1]

Note the string of the triple quotation mark and dash just before closing the 3-quotation mark.

Using a 3-quoted string ensures that if the user enters "\\" (spaces added for visual clarity) in the string, this does not violate the evaluator;
The token at the end is fault tolerant if the user string ends with the "\" character. Before assigning the result, we cut the inserted stroke with [: -1]

Thus, there is no need to worry about what the user enters if it is recorded in raw format.

+2

MakerDrone Jul 01 2018-12-12T00:

source share

In Python 2.5, the correct encoding is "unicode_escape", not "unicode-escape" (note the underscore).

I'm not sure if the new version of Python changed the Unicode name, but it worked only with underscore here.

In any case, this is it.

+1

Kaniabi Nov 17 '09 at 18:14

source share

At some point, you will encounter problems when you encounter special characters, such as Chinese characters or emoticons, in the line you want to decode, that is, errors that look like this:

 UnicodeEncodeError: 'ascii' codec can't encode characters in position 109-123: ordinal not in range(128)

In my case (twitter data processing) I decoded as follows to allow me to see all characters without errors

 >>> s = '\u003cfoo\u003e' >>> s.decode( 'unicode-escape' ).encode( 'utf-8' ) >>> <foo>

0

OkezieE Mar 29 '14 at 3:06

source share

This is a little dangerous depending on where the line is coming from, but what about:

 >>> s = '\u003cfoo\u003e' >>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii') '<foo>'

-one

Ned Batchelder Nov 06 '08 at 2:01

source share

hark · Accepted Answer · 2008-11-06 02:26

It took me a while to figure this out, but this page got the best answer:

 >>> s = '\u003cfoo/\u003e' >>> s.decode( 'unicode-escape' ) u'<foo/>' >>> s.decode( 'unicode-escape' ).encode( 'ascii' ) '<foo/>'

There is also a raw-unicode-escape codec that handles a different way of specifying Unicode strings. For more information, see the “Unicode Constructors” section of the linked page (since I'm not Unicode-saavy).

EDIT: see also Standard Python Encodings .

How to handle ASCII string as unicode and unescape escaped characters in it in python?

More articles: