CountVectorizer ignores self

Question

CountVectorizer ignores self

Why does the Sklearn CountVectorizer ignore the pronoun "I"?

ngram_vectorizer = CountVectorizer(analyzer = "word", ngram_range = (2,2), min_df = 1) ngram_vectorizer.fit_transform(['HE GAVE IT TO I']) <1x3 sparse matrix of type '<class 'numpy.int64'>' ngram_vectorizer.get_feature_names() ['gave it', 'he gave', 'it to']

+5

python scikit-learn

Alex Oct 21 '15 at 13:22

source share

1 answer

ldirer · Accepted Answer · 2015-10-21T16:34:55+0000

By default, the tokenizer only considers 2-character (or more) words.

You can change this behavior by passing the corresponding token_pattern to your CountVectorizer .

Default template (see signature in documents ):

 'token_pattern': u'(?u)\\b\\w\\w+\\b'

You can get a CountVectorizer that does not carry single-letter words by changing the default value, for example:

 from sklearn.feature_extraction.text import CountVectorizer ngram_vectorizer = CountVectorizer(analyzer="word", ngram_range=(2,2), token_pattern=u"(?u)\\b\\w+\\b",min_df=1) ngram_vectorizer.fit_transform(['HE GAVE IT TO I']) print(ngram_vectorizer.get_feature_names())

What gives:

 ['gave it', 'he gave', 'it to', 'to i']

CountVectorizer ignores self

More articles: