How to handle ASCII string as unicode and unescape escaped characters in it in python?

For example, if I have a Unicode string, I can encode it as an ASCII string, for example:

>>> u'\u003cfoo/\u003e'.encode('ascii') '<foo/>' 

However, I have, for example, this ASCII line:

 '\u003foo\u003e' 

... that I want to turn into the same ASCII line as in the first example above:

 '<foo/>' 
+24
python unicode ascii
Nov 06 '08 at 1:55
source share
5 answers

It took me a while to figure this out, but this page got the best answer:

 >>> s = '\u003cfoo/\u003e' >>> s.decode( 'unicode-escape' ) u'<foo/>' >>> s.decode( 'unicode-escape' ).encode( 'ascii' ) '<foo/>' 

There is also a raw-unicode-escape codec that handles a different way of specifying Unicode strings. For more information, see the “Unicode Constructors” section of the linked page (since I'm not Unicode-saavy).

EDIT: see also Standard Python Encodings .

+40
Nov 06 '08 at 2:26
source share

Ned Batchelder said:

This is a bit dangerous depending on where the line is coming from, but what about:

 >>> s = '\u003cfoo\u003e' >>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii') '<foo>' 

In fact, this method can be made safe as follows:

 >>> s = '\u003cfoo\u003e' >>> s_unescaped = eval('u"""'+s.replace('"', r'\"')+'-"""')[:-1] 

Note the string of the triple quotation mark and dash just before closing the 3-quotation mark.

  • Using a 3-quoted string ensures that if the user enters "\\" (spaces added for visual clarity) in the string, this does not violate the evaluator;
  • The token at the end is fault tolerant if the user string ends with the "\" character. Before assigning the result, we cut the inserted stroke with [: -1]

Thus, there is no need to worry about what the user enters if it is recorded in raw format.

+2
Jul 01 2018-12-12T00:
source share

In Python 2.5, the correct encoding is "unicode_escape", not "unicode-escape" (note the underscore).

I'm not sure if the new version of Python changed the Unicode name, but it worked only with underscore here.

In any case, this is it.

+1
Nov 17 '09 at 18:14
source share

At some point, you will encounter problems when you encounter special characters, such as Chinese characters or emoticons, in the line you want to decode, that is, errors that look like this:

 UnicodeEncodeError: 'ascii' codec can't encode characters in position 109-123: ordinal not in range(128) 

In my case (twitter data processing) I decoded as follows to allow me to see all characters without errors

 >>> s = '\u003cfoo\u003e' >>> s.decode( 'unicode-escape' ).encode( 'utf-8' ) >>> <foo> 
0
Mar 29 '14 at 3:06
source share

This is a little dangerous depending on where the line is coming from, but what about:

 >>> s = '\u003cfoo\u003e' >>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii') '<foo>' 
-one
Nov 06 '08 at 2:01
source share



All Articles