Thin pattern for n-gram in TfidfVectorizer in python

Does TfidfVectorizer detect n-grams with python regular expressions ?

This problem occurs when reading the documentation for scikit-learn TfidfVectorizer , I see that the template recognizes n-grams at the level of the word token_pattern=u'(?u)\b\w\w+\b' . I am having problems with how this works. Consider the case with two graphs. If I do this:

  In [1]: import re In [2]: re.findall(u'(?u)\b\w\w+\b',u'this is a sentence! this is another one.') Out[2]: [] 

I do not find any bigrams. Pay attention to:

  In [2]: re.findall(u'(?u)\w+ \w*',u'this is a sentence! this is another one.') Out[2]: [u'this is', u'a sentence', u'this is', u'another one'] 

finds some (but not all, for example, u'is a' , and all other even graphs are absent). What am I doing wrong when interpreting the \b character function?

Note: According to the documentation of the regular expression module, the \b character in re is assumed to be:

\ b Matches an empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by a space or non-alphanumeric character without underscore.

I see questions regarding the problem of determining n-grams in python (see 1 , 2 ), so the secondary question is: should I do this and add the combined n-grams before submitting my text to TfidfVectorizer?

+6
source share
1 answer

You must add regular expressions with r . The following works:

 >>> re.findall(r'(?u)\b\w\w+\b',u'this is a sentence! this is another one.') [u'this', u'is', u'sentence', u'this', u'is', u'another', u'one'] 

This is a known bug in the documentation , but if you look at the source code , they use raw literals.

+1
source

Source: https://habr.com/ru/post/984293/


All Articles