Strange re.sub behavior with utf-8 lines

Question

Strange re.sub behavior with utf-8 lines

Can someone explain this strange behavior to me? I would expect both replacement methods to work or not to work at the same time. Is it just me or is there anyone who does not consider it consistent?

>>> u'è'.replace("\xe0","") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128) >>> re.sub(u'è','\xe0','',flags=re.UNICODE) ''

(Please note that I am not asking to explain why u'è'.replace ("\ xe0", "") is causing an error!)

+6

python regex encoding

luke14free May 07 '12 at 15:54

source share

2 answers

subiet · Answer 1 · 2012-05-07T16:40:22+0000

From Unicode Doc

the arguments to these methods can be Unicode strings or 8-bit strings. 8-bit strings will be converted to Unicode before wrapping the operation; The default ASCII encoding will be Pythons, so characters greater than 127 throw an exception

From Re Doc :

This module provides regular expression matching operations similar to those found in Perl. Both patterns and search strings can be Unicode strings, as well as 8-bit strings.

Since you do not explicitly specify the Unicode flag for the Re module, it does not try to convert and therefore does not raise an error. That's why they don’t behave coherently.

vincent-lg · Answer 2 · 2017-05-26T00:27:18+0000

Python 2.X has some unnatural encoding handling that accepts an implicit conversion. It will try to play the unicode and no-unicode lines when the user does not complete the conversion. In the end, this does not solve the problem: coding must be confirmed by the developer from the very beginning. Python 2 just makes things less explicit and slightly less obvious.

 >>> u'è'.replace(u"\xe0", u"") u'\xe8'

What is your original example, besides, I specifically told Python that all the lines were unicode. If you do not, Python will try to convert them. And since the default encoding in Python 2 is ASCII, this obviously will not work with your example.

Coding is a tricky question, but with some good habits (like early conversion, always being sure which data type the program is processing at a given point), usually (and I insist, KILL) is good.

Hope this helps!

Strange re.sub behavior with utf-8 lines

More articles: