Your code:
print re.findall(r'[\u0600-\u06FF]+',my_string)
When matching byte sequences there is no such thing as Unicode code codes. Therefore, the escape sequences \u in the regex do not make any sense. They are not interpreted as you thought, but simply mean u .
Therefore, when parsing a regular expression for bytes, this is equivalent to:
print re.findall(r'[u0600-u06FF]+',my_string)
This character class is interpreted as "one of u060 , or bytes in the range 0-u , or one of 06FF ". This, in turn, is equivalent to [0-u] , since all other bytes are already included in this range.
print re.findall(r'[0-u]+', my_string)
Demonstration:
my_string = "What is thizz?" print re.findall(r'[\u0600-\u06FF]+',my_string) ['What', 'is', 'thi', '?']
Note that zz does not match, as it stands for u in the ASCII character set.
source share