Word border for use in Unicode text for Python regular expression

Question

Word border for use in Unicode text for Python regular expression

I want to use a word boundary in a regular expression to match some text in Unicode. Unicode letters are defined as the word boundary in a Python regular expression, as here:

>>> re.search(r"\by\b","üyü") <_sre.SRE_Match object at 0x02819E58> >>> re.search(r"\by\b","ğyğ") <_sre.SRE_Match object at 0x028250C8> >>> re.search(r"\by\b","uyu") >>>

What should I do so that the character of a word character does not match Unicode letters?

+6

python regex unicode

Mert nuhoglu Oct 15 '13 at 7:38

source share

3 answers

Use re.UNICODE :

 >>> re.search(r"\by\b","üyü", re.UNICODE) >>>

+7

Michael brennan Oct 15 '13 at 7:45

source share

 #!/usr/bin/python # -*- coding: utf-8 -*- s = ur"abcd " import re rx1 = re.compile(ur"(?u)") rx2 = re.compile(ur"(?u)\b") rx3 = re.compile(ur"(?u)\b\b") print rx1.findall(s) print rx2.findall(s) print rx3.findall(s) print re.search(ur'(?u)\b', ur'') print re.search(ur'(?u)\b\b', ur'')

Output:

 [u'\u0410\u0411\u0412'] [u'\u0410\u0411\u0412'] [] <_sre.SRE_Match object at 0x01F056B0> None

0

Alexander Lubyagin Dec 6 '17 at 8:27

source share

rolandvarga · Accepted Answer · 2013-10-15T09:22:20+0000

You can use it as follows:

 re.search(r'(?u)\by\b', 'üyü')

To familiarize yourself with flag experiments, follow these steps: (? ILmsux)

As a good check, read the Python Core Programming Application 3rd Edition. It has a good chapter on Regex.

Word border for use in Unicode text for Python regular expression

More articles: