Using counts and tfidf as functions with scikit learn

I am trying to use both counts and tfidf as functions for the multidimensional NB model. Here is my code:

text = ["this is spam", "this isn't spam"] labels = [0,1] count_vectorizer = CountVectorizer(stop_words="english", min_df=3) tf_transformer = TfidfTransformer(use_idf=True) combined_features = FeatureUnion([("counts", self.count_vectorizer), ("tfidf", tf_transformer)]).fit(self.text) classifier = MultinomialNB() classifier.fit(combined_features, labels) 

But I get an error with FeatureUnion and tfidf:

 TypeError: no supported conversion for types: (dtype('S18413'),) 

Any idea why this might happen? Is it impossible to have both counts and tfidf as functions?

+6
source share
1 answer

The error did not come from FeatureUnion , it came from TfidfTransformer

You should use TfidfVectorizer instead of TfidfTransformer , the transformer expects a numpy array as input, not plaintext, so TypeError

Also the test proposal is too small for testing Tfidf, so try using a larger one, here is an example:

 from nltk.corpus import brown from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.pipeline import FeatureUnion from sklearn.naive_bayes import MultinomialNB # Let get more text from NLTK text = [" ".join(i) for i in brown.sents()[:100]] # I'm just gonna assign random tags. labels = ['yes']*50 + ['no']*50 count_vectorizer = CountVectorizer(stop_words="english", min_df=3) tf_transformer = TfidfVectorizer(use_idf=True) combined_features = FeatureUnion([("counts", count_vectorizer), ("tfidf", tf_transformer)]).fit_transform(text) classifier = MultinomialNB() classifier.fit(combined_features, labels) 
+8
source

Source: https://habr.com/ru/post/979016/


All Articles