Thin pattern for n-gram in TfidfVectorizer in python

Question

Thin pattern for n-gram in TfidfVectorizer in python

Does TfidfVectorizer detect n-grams with python regular expressions ?

This problem occurs when reading the documentation for scikit-learn TfidfVectorizer , I see that the template recognizes n-grams at the level of the word token_pattern=u'(?u)\b\w\w+\b' . I am having problems with how this works. Consider the case with two graphs. If I do this:

  In [1]: import re In [2]: re.findall(u'(?u)\b\w\w+\b',u'this is a sentence! this is another one.') Out[2]: []

I do not find any bigrams. Pay attention to:

  In [2]: re.findall(u'(?u)\w+ \w*',u'this is a sentence! this is another one.') Out[2]: [u'this is', u'a sentence', u'this is', u'another one']

finds some (but not all, for example, u'is a' , and all other even graphs are absent). What am I doing wrong when interpreting the \b character function?

Note: According to the documentation of the regular expression module, the \b character in re is assumed to be:

\ b Matches an empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by a space or non-alphanumeric character without underscore.

I see questions regarding the problem of determining n-grams in python (see 1 , 2 ), so the secondary question is: should I do this and add the combined n-grams before submitting my text to TfidfVectorizer?

+6

python scikit-learn regex n-gram

nikosd Mar 26 '15 at 23:51

source share

1 answer

elyase · Accepted Answer · 2015-06-03T09:33:05+0000

You must add regular expressions with r . The following works:

 >>> re.findall(r'(?u)\b\w\w+\b',u'this is a sentence! this is another one.') [u'this', u'is', u'sentence', u'this', u'is', u'another', u'one']

This is a known bug in the documentation , but if you look at the source code , they use raw literals.

Thin pattern for n-gram in TfidfVectorizer in python

More articles: