Does TfidfVectorizer detect n-grams with python regular expressions ?
This problem occurs when reading the documentation for scikit-learn TfidfVectorizer , I see that the template recognizes n-grams at the level of the word token_pattern=u'(?u)\b\w\w+\b' . I am having problems with how this works. Consider the case with two graphs. If I do this:
In [1]: import re In [2]: re.findall(u'(?u)\b\w\w+\b',u'this is a sentence! this is another one.') Out[2]: []
I do not find any bigrams. Pay attention to:
In [2]: re.findall(u'(?u)\w+ \w*',u'this is a sentence! this is another one.') Out[2]: [u'this is', u'a sentence', u'this is', u'another one']
finds some (but not all, for example, u'is a' , and all other even graphs are absent). What am I doing wrong when interpreting the \b character function?
Note: According to the documentation of the regular expression module, the \b character in re is assumed to be:
\ b Matches an empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by a space or non-alphanumeric character without underscore.
I see questions regarding the problem of determining n-grams in python (see 1 , 2 ), so the secondary question is: should I do this and add the combined n-grams before submitting my text to TfidfVectorizer?
source share