How to determine if a string is unicode escape code

How do you determine if the string contains escaped unicode so you know whether to run .decode("unicode-escape") ?

For instance:

test.py

 # -*- coding: utf-8 -*- str_escaped = '"A\u0026B"' str_unicode = '"́  "' arr_all_strings = [str_escaped, str_unicode] def is_escaped_unicode(str): #how do I determine if this is escaped unicode? pass for str in arr_all_strings: if is_escaped_unicode(str): str = str.decode("unicode-escape") print str 

Current output:

 "A\u0026B" "́  " 

Expected Result:

 "A&B" "́  " 

How to define is_escaped_unicode(str) to determine if the line that was passed is really unicode?

+5
source share
3 answers
 str_escaped = u'"A\u0026B"' str_unicode = '"́  "' arr_all_strings = [str_escaped, str_unicode] def is_ascii(s): return all(ord(c) < 128 for c in s) def is_escaped_unicode(str): #how do I determine if this is escaped unicode? if is_ascii(str): # escaped unicode is ascii return True return False for str in arr_all_strings: if is_escaped_unicode(str): str = str.decode("unicode-escape") print str 

The following code will work for your case.

Explain:

  • The entire string in str_escaped is in the Ascii range.

  • Char in str_unicode is not contained in the Ascii range.

+2
source

You can not.

It is impossible to determine if "A \ u0026B" was originally written text, or the data is simply bytes of "A \ u0026B", or if we arrived there from some other encoding.

How ... do you know whether to run .decode("unicode-escape")

You must know if someone has previously text.encode('unicode-escape') . Bytes themselves cannot tell you.

Of course, you can guess by looking for escape sequences \ u or \ U, or just try / other than decoding and see what happens, but I do not recommend going along this route.

If you encounter bytestring in your application and you still don't know what encoding is, then your problem lies elsewhere and needs to be fixed elsewhere.

+6
source

Here is a rough way to do it. Try decoding as unicode-escape, and if that succeeds, the resulting string will be shorter than the original string.

 str_escaped = '"A\u0026B"' str_unicode = '"́  "' arr_all_strings = [str_escaped, str_unicode] def decoder(s): y = s.decode('unicode-escape') return y if len(y) < len(s) else s.decode('utf8') for s in arr_all_strings: print s, decoder(s) 

Output

 "A\u0026B" "A&B" "  " "  " 

But seriously, you can save a lot of energy if you can upgrade to Python 3. And if you can't immediately upgrade to Python 3, you may find this article useful: Pragmatic Unicode , which was written by veteran SO Ned Batchelder.

+1
source

Source: https://habr.com/ru/post/1270817/


All Articles