, re.UNIC...">

Regular expression and unicode utf-8 in python?

I have a code block: (Django code)

list_temp = [] tagname_re = re.compile(r'^[\w+\.-]+$', re.UNICODE) for key,tag in list.items(): if len(tag) > settings.FORM_MAX_LENGTH_OF_TAG or len(tag) < settings.FORM_MIN_LENGTH_OF_TAG: raise forms.ValidationError(_('please use between %(min)s and %(max)s characters in you tags') % { 'min': settings.FORM_MIN_LENGTH_OF_TAG, 'max': settings.FORM_MAX_LENGTH_OF_TAG}) if not tagname_re.match(tag): raise forms.ValidationError(_('please use following characters in tags: letters , numbers, and characters \'.-_\'')) # only keep one same tag if tag not in list_temp and len(tag.strip()) > 0: list_temp.append(tag) 

This allows me to put the tag name in a Unicode character.

But I don’t know why with my Unicode (khmer uncode Khmer Symbols Range: 19E0-19FF Unicode Standard, version 4.0). I could not.

My question is:

How can I change the above code tagname_re = re.compile(r'^[\w+\.-]+$', re.UNICODE) to adapt my Unicode character. "If I enter a tag with" αž“αž™αŸ„αž”αžΆαž™ ", did I receive a message?

please use following characters in tags: letters , numbers, and characters \'.-_\''

+4
source share
3 answers

αŸ„ (U + 17C4 KHMER VOWEL SIGN AA) and អ (U + 17B6 KHMER VOWEL SIGN AA) are not letters, they combine labels, so they do not match \ w.

+5
source

Check out the new regex implementation on PyPI:

 http://pypi.python.org/pypi/regex 

Python 3 says:

 >>> import regex >>> regex.match("\w", "\u17C4") <_regex.Match object at 0x00F03988> >>> regex.match("\w", "\u17B6") <_regex.Match object at 0x00F03D08> 
+4
source

bobince's answer is definitely correct. However, before you hit this problem, maybe one more thing - tag definitely a unicode , not str ? For instance:

 >>> str_version = 'αž“αž™αŸ„αž”αžΆαž™' >>> type(str_version) <type 'str'> >>> print str_version αž“αž™αŸ„αž”αžΆαž™ >>> unicode_version = 'αž“αž™αŸ„αž”αžΆαž™'.decode('utf-8') >>> type(unicode_version) <type 'unicode'> >>> print unicode_version αž“αž™αŸ„αž”αžΆαž™ >>> r = re.compile(r'^(\w+)',re.U) >>> r.search(str_version).group(1) '\xe1' >>> print r.search(str_version).group(1) >>> r.search(unicode_version).group(1) u'\1793\u1799' >>> print r.search(unicode_version).group(1) αž“αž™ 

As another small dot in your regular expression, + in a character class simply means that literally + also allowed in a sequence of letters and punctuation marks.

+3
source

Source: https://habr.com/ru/post/1345316/


All Articles