What does "(? U)" in a regular expression do?

I looked at how tokenization is implemented in scikit-learn and found this regex ( source ):

token_pattern = r"(?u)\b\w\w+\b" 

The regex is pretty simple, but I've never seen the (?u) part before. Can someone explain to me what this part does?

+5
source share
1 answer

It includes the re.U ( re.UNICODE ) flag for this expression.

From the documentation:

(?iLmsux)

(one or more letters from the set of 'i' , 'L' , 'm' , 's' , 'u' , 'x' .) The group corresponds to an empty string; letters set the corresponding flags: re.I (ignore case), re.L (depends on the locale), re.M (multi-line), re.S (the dot matches all), re.U (depends on Unicode) and re.X (verbose), for the entire regular expression. (The flags are described in the section entitled โ€œModule Contentโ€.) This is useful if you want to include flags as part of a regular expression, instead of passing the flag argument to the re.compile() function.

+13
source

Source: https://habr.com/ru/post/1241638/


All Articles