Is it possible to create a Python [a-zA-Z] regular expression pattern to match and replace Unicode characters other than ASCII?

Question

Is it possible to create a Python [a-zA-Z] regular expression pattern to match and replace Unicode characters other than ASCII?

In the following regular expression, I would like each character in the string to be replaced with an “X”, but it does not work.

In Python 2.7:

>>> import re >>> re.sub(u"[a-zA-Z]","X","dfäg") 'XX\xc3\xa4X'

or

 >>> re.sub("[a-zA-Z]","X","dfäg",re.UNICODE) u'XX\xe4X'

In Python 3.4:

 >>> re.sub("[a-zA-Z]","X","dfäg") 'XXäX'

Is there any way to "customize" the [a-zA-Z] template to match "ä", "ü", etc.? If this is not possible, how can I create a similar character range pattern between square brackets that will include Unicode characters in the regular full alphabet range? I mean, for example, in a language like German, “ä” will be placed somewhere next to “a” in the alphabet, so one would expect it to be included in the range “az”.

+5

python regex unicode

X mann Oct 14 '15 at 14:19

source share

1 answer

Wiktor stribiżew · Accepted Answer · 2015-10-14T14:37:49+0000

you can use

 (?![\d_])\w

With Unicode modifier. (?![\d_]) look-ahead restricts the abbreviation class \w , since it cannot match any digits ( \d ) or underscores.

Watch the regex demo

A Python 3 demo :

 import re print (re.sub(r"(?![\d_])\w","X","dfäg")) # => XXXX

Regarding Python 2 :

 # -*- coding: utf-8 -*- import re s = "dfäg" w = re.sub(ur'(?![\d_])\w', u'X', s.decode('utf8'), 0, re.UNICODE).encode("utf8") print(w)

Is it possible to create a Python [a-zA-Z] regular expression pattern to match and replace Unicode characters other than ASCII?

More articles: