CountVectorizer ignores self

Why does the Sklearn CountVectorizer ignore the pronoun "I"?

ngram_vectorizer = CountVectorizer(analyzer = "word", ngram_range = (2,2), min_df = 1) ngram_vectorizer.fit_transform(['HE GAVE IT TO I']) <1x3 sparse matrix of type '<class 'numpy.int64'>' ngram_vectorizer.get_feature_names() ['gave it', 'he gave', 'it to'] 
+5
source share
1 answer

By default, the tokenizer only considers 2-character (or more) words.

You can change this behavior by passing the corresponding token_pattern to your CountVectorizer .

Default template (see signature in documents ):

 'token_pattern': u'(?u)\\b\\w\\w+\\b' 

You can get a CountVectorizer that does not carry single-letter words by changing the default value, for example:

 from sklearn.feature_extraction.text import CountVectorizer ngram_vectorizer = CountVectorizer(analyzer="word", ngram_range=(2,2), token_pattern=u"(?u)\\b\\w+\\b",min_df=1) ngram_vectorizer.fit_transform(['HE GAVE IT TO I']) print(ngram_vectorizer.get_feature_names()) 

What gives:

 ['gave it', 'he gave', 'it to', 'to i'] 
+7
source

Source: https://habr.com/ru/post/1234196/


All Articles