I am currently studying scikit learn pipelines. I also want to pre-process the data with a pipeline. However, my train and test data have different levels of categorical variable. Example: Consider:
import pandas as pd train = pd.Series(list('abbaa')) test = pd.Series(list('abcd'))
I wrote TransformerMixinClass using pandas
class CreateDummies(TransformerMixin): def transform(self, X, **transformparams): return pd.get_dummies(X).copy() def fit(self, X, y=None, **fitparams): return self
fit_transform gives 2 columns for train data columns and 4 columns for test data. So it is not surprising, but not suitable for the pipeline
Similary, I tried to import the label encoder (and OneHotEncoder for potential next steps):
from sklearn.preprocessing import LabelEncoder, OneHotEncoder le = LabelEncoder() le.fit_transform(train) le.transform(test)
which leads, unsurprisingly, to error.
So, the problem here is that I need the information contained in the test case. Is there a good way to include this in the pipeline?
source share