What is the correct way to encode escape characters in Python 2 without destroying Unicode?

I think I'm going crazy with Python Unicode strings. I am trying to encode escape characters in a Unicode string without escaping the actual Unicode characters. I get this:

In [14]: a = u"Example\n" In [15]: b = u"\n" In [16]: print a Example In [17]: print b  In [18]: print a.encode('unicode_escape') Example\n In [19]: print b.encode('unicode_escape') \u041f\u0440\u0438\u043c\u0435\u0440\n 

while I am in desperate need (the English example works the way I want, obviously):

 In [18]: print a.encode('unicode_escape') Example\n In [19]: print b.encode('unicode_escape') \n 

What should I do if I don’t upgrade to Python 3?

PS: As indicated below, I'm actually trying to avoid control characters. Do I need more than just this to be seen.

+4
source share
4 answers

Correct the terminology first. What you are trying to do is replace the “escape characters” with an equivalent “escape sequence”.

I could not find a built-in method for this, and no one has published it yet. Fortunately, this is not a complicated feature to record.

 control_chars = [unichr(c) for c in range(0x20)] # you may extend this as required def control_escape(s): chars = [] for c in s: if c in control_chars: chars.append(c.encode('unicode_escape')) else: chars.append(c) return u''.join(chars) 

Or a slightly less readable single-layer version:

 def control_escape2(s): return u''.join([c.encode('unicode_escape') if c in control_chars else c for c in s]) 
+3
source

The backslash that controls ascii characters in the middle of Unicode data is definitely a useful task. But this does not just elude them, it properly cancels them when you want to return the actual data.

In python stdlib there should be a way to do this, but no. I sent an error report: http://bugs.python.org/issue18679

but at the same time, work here works using translation and hacking:

 tm = dict((k, repr(chr(k))[1:-1]) for k in range(32)) tm[0] = r'\0' tm[7] = r'\a' tm[8] = r'\b' tm[11] = r'\v' tm[12] = r'\f' tm[ord('\\')] = '\\\\' b = u"\n" c = b.translate(tm) print(c) ## results in: \n 

All control characters without a backslash character will be escaped using the sequence \ x ##, but if you need something else, with this, your translation matrix can do this. However, this approach is not unprofitable, so it works for me.

But getting it back is too hacky because you cannot just translate character sequences back to individual characters using translation.

 d = c.encode('latin1', 'backslashreplace').decode('unicode_escape') print(d) ## result in  with trailing newline character 

you really need to encode characters that map to bytes individually using latin1, while a backslash escapes unicode characters that latin1 doesn't know about, so the unicode_escape codec can handle all the correct paths.

UPDATE

So, I had a case where I need this to work in both python2.7 and python3.3. Here is what I did (buried in the _compat.py module):

 if isinstance(b"", str): byte_types = (str, bytes, bytearray) text_types = (unicode, ) def uton(x): return x.encode('utf-8', 'surrogateescape') def ntob(x): return x def ntou(x): return x.decode('utf-8', 'surrogateescape') def bton(x): return x else: byte_types = (bytes, bytearray) text_types = (str, ) def uton(x): return x def ntob(x): return x.encode('utf-8', 'surrogateescape') def ntou(x): return x def bton(x): return x.decode('utf-8', 'surrogateescape') escape_tm = dict((k, ntou(repr(chr(k))[1:-1])) for k in range(32)) escape_tm[0] = u'\0' escape_tm[7] = u'\a' escape_tm[8] = u'\b' escape_tm[11] = u'\v' escape_tm[12] = u'\f' escape_tm[ord('\\')] = u'\\\\' def escape_control(s): if isinstance(s, text_types): return s.translate(escape_tm) else: return s.decode('utf-8', 'surrogateescape').translate(escape_tm).encode('utf-8', 'surrogateescape') def unescape_control(s): if isinstance(s, text_types): return s.encode('latin1', 'backslashreplace').decode('unicode_escape') else: return s.decode('utf-8', 'surrogateescape').encode('latin1', 'backslashreplace').decode('unicode_escape').encode('utf-8', 'surrogateescape') 
+2
source

The .encode method returns a byte string ( str type in Python 2), so it cannot return Unicode characters.

But since there are only a few \ sequences, you can easily .replace them manually. See http://docs.python.org/reference/lexical_analysis.html#string-literals for a complete list.

+1
source

.encode('unicode_escape') returns a string of bytes. You probably want to avoid control characters directly in the Unicode line:

 # coding: utf8 import re def esc(m): return u'\\x{:02x}'.format(ord(m.group(0))) s = u'\r\t\b马克\n' # Match control characters 0-31. # Use DOTALL option to match end-of-line control characters as well. print re.sub(ur'(?s)[\x00-\x1f]',esc,s) 

Conclusion:

 \x0d\x09\x08马克\x0a 

Note that there are other Unicode control characters beyond 0-31, so you might need something more:

 # coding: utf8 import re import unicodedata as ud def esc(m): c = m.group(0) if ud.category(c).startswith('C'): return u'\\u{:04x}'.format(ord(c)) return c s = u'\rMark\t\b马克\n' # Match ALL characters so the replacement function # can test the category. Not very efficient if the string is long. print re.sub(ur'(?s).',esc,s) 

Conclusion:

 \u000dMark\u0009\u0008马克\u000a 

Perhaps you need more control over what is considered a manager. There are a number of categories . You can create a regular expression matching a specific type with:

 import sys import re import unicodedata as ud # Generate a regular expression that matches any Cc category Unicode character. Cc_CODES = u'(?s)[' + re.escape(u''.join(unichr(n) for n in range(sys.maxunicode+1) if ud.category(unichr(n)) == 'Cc')) + u']' 
0
source

Source: https://habr.com/ru/post/1402328/


All Articles