By default, the tokenizer only considers 2-character (or more) words.
You can change this behavior by passing the corresponding token_pattern to your CountVectorizer .
Default template (see signature in documents ):
'token_pattern': u'(?u)\\b\\w\\w+\\b'
You can get a CountVectorizer that does not carry single-letter words by changing the default value, for example:
from sklearn.feature_extraction.text import CountVectorizer ngram_vectorizer = CountVectorizer(analyzer="word", ngram_range=(2,2), token_pattern=u"(?u)\\b\\w+\\b",min_df=1) ngram_vectorizer.fit_transform(['HE GAVE IT TO I']) print(ngram_vectorizer.get_feature_names())
What gives:
['gave it', 'he gave', 'it to', 'to i']
source share