I looked at how tokenization is implemented in scikit-learn and found this regex ( source ):
token_pattern = r"(?u)\b\w\w+\b"
The regex is pretty simple, but I've never seen the (?u) part before. Can someone explain to me what this part does?
fwind source share