CountVectorizer returns only zeros

Question

CountVectorizer returns only zeros

I am trying to extract some functions from this document, given a given set of functions.

from sklearn.feature_extraction.text import CountVectorizer
features = ['a', 'b', 'c']
doc = ['a', 'c']

vectoriser = CountVectorizer()
vectoriser.vocabulary = features
vectoriser.fit_transform(doc)

However, the output is a 2x3 array filled with zeros, not:

desired_output = [[1, 0, 0]
                  [0, 0, 1]]

Any help would be greatly appreciated

+4

python scikit-learn

Immortalz Mar 6 '17 at 20:05

source share

1 answer

Kewl · Accepted Answer · 2017-03-06T20:23:02+0000

This is because the default marker template in CountVectorizer gets rid of any words that are just one character long. You can change the default marker template to fix this:

from sklearn.feature_extraction.text import CountVectorizer
features = ['a', 'b', 'c']
doc = ['a', 'c']

vectoriser = CountVectorizer(vocabulary=features, token_pattern=r"\b\w+\b")

vectoriser.fit_transform(doc)

CountVectorizer returns only zeros

More articles: