CountVectorizer returns only zeros

I am trying to extract some functions from this document, given a given set of functions.

from sklearn.feature_extraction.text import CountVectorizer
features = ['a', 'b', 'c']
doc = ['a', 'c']

vectoriser = CountVectorizer()
vectoriser.vocabulary = features
vectoriser.fit_transform(doc)

However, the output is a 2x3 array filled with zeros, not:

desired_output = [[1, 0, 0]
                  [0, 0, 1]]

Any help would be greatly appreciated

+4
source share
1 answer

This is because the default marker template in CountVectorizer gets rid of any words that are just one character long. You can change the default marker template to fix this:

from sklearn.feature_extraction.text import CountVectorizer
features = ['a', 'b', 'c']
doc = ['a', 'c']

vectoriser = CountVectorizer(vocabulary=features, token_pattern=r"\b\w+\b")

vectoriser.fit_transform(doc)
+2
source

Source: https://habr.com/ru/post/1671634/


All Articles