Regular expression and unicode utf-8 in python?

Question

Regular expression and unicode utf-8 in python?

I have a code block: (Django code)

list_temp = [] tagname_re = re.compile(r'^[\w+\.-]+$', re.UNICODE) for key,tag in list.items(): if len(tag) > settings.FORM_MAX_LENGTH_OF_TAG or len(tag) < settings.FORM_MIN_LENGTH_OF_TAG: raise forms.ValidationError(_('please use between %(min)s and %(max)s characters in you tags') % { 'min': settings.FORM_MIN_LENGTH_OF_TAG, 'max': settings.FORM_MAX_LENGTH_OF_TAG}) if not tagname_re.match(tag): raise forms.ValidationError(_('please use following characters in tags: letters , numbers, and characters \'.-_\'')) # only keep one same tag if tag not in list_temp and len(tag.strip()) > 0: list_temp.append(tag)

This allows me to put the tag name in a Unicode character.

But I don’t know why with my Unicode (khmer uncode Khmer Symbols Range: 19E0-19FF Unicode Standard, version 4.0). I could not.

My question is:

How can I change the above code tagname_re = re.compile(r'^[\w+\.-]+$', re.UNICODE) to adapt my Unicode character. "If I enter a tag with" នយោបាយ ", did I receive a message?

please use following characters in tags: letters , numbers, and characters \'.-_\''

+4

python django regex unicode

kn3l Mar 26 '11 at 8:26

source share

3 answers

Check out the new regex implementation on PyPI:

 http://pypi.python.org/pypi/regex

Python 3 says:

 >>> import regex >>> regex.match("\w", "\u17C4") <_regex.Match object at 0x00F03988> >>> regex.match("\w", "\u17B6") <_regex.Match object at 0x00F03D08>

+4

Mrab Apr 01 '11 at 23:35

source share

bobince's answer is definitely correct. However, before you hit this problem, maybe one more thing - tag definitely a unicode , not str ? For instance:

 >>> str_version = 'នយោបាយ' >>> type(str_version) <type 'str'> >>> print str_version នយោបាយ >>> unicode_version = 'នយោបាយ'.decode('utf-8') >>> type(unicode_version) <type 'unicode'> >>> print unicode_version នយោបាយ >>> r = re.compile(r'^(\w+)',re.U) >>> r.search(str_version).group(1) '\xe1' >>> print r.search(str_version).group(1) >>> r.search(unicode_version).group(1) u'\1793\u1799' >>> print r.search(unicode_version).group(1) នយ

As another small dot in your regular expression, + in a character class simply means that literally + also allowed in a sequence of letters and punctuation marks.

+3

Mark longair Mar 26 '11 at 9:16

source share

bobince · Accepted Answer · 2011-03-26T08:54:57+0000

ោ (U + 17C4 KHMER VOWEL SIGN AA) and ា (U + 17B6 KHMER VOWEL SIGN AA) are not letters, they combine labels, so they do not match \ w.

Regular expression and unicode utf-8 in python?

More articles: