I am working on a text classification problem that I set up like this (I missed the data processing steps to perform the instantiation, but they will create a dataframe called data with columns X and y ):
import sklearn.model_selection as ms from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.ensemble import RandomForestClassifier sim = Pipeline([('vec', TfidfVectorizer((analyzer="word", ngram_range=(1, 2))), ("rdf", RandomForestClassifier())])
Now I'm trying to test this model, training it for 2/3 of the data and taking it for the remaining 1/3, for example:
train, test = ms.train_test_split(data, test_size = 0.33) sim.fit(train.X, train.y) sim.score(test.X, test.y)
I want to do this three times for three different test suites, but using cross_val_score gives me results that are much lower.
ms.cross_val_score(sim, data.X, data.y) # [ 0.29264069 0.36729223 0.22977941]
As far as I know, each of the estimates in this array should be prepared by training on 2/3 of the data and counting the remaining 1/3 using the sim.score method. So why are they all so lower?
source share