Creating a mannequin in a conveyor with various levels in a train and test set

I am currently studying scikit learn pipelines. I also want to pre-process the data with a pipeline. However, my train and test data have different levels of categorical variable. Example: Consider:

import pandas as pd train = pd.Series(list('abbaa')) test = pd.Series(list('abcd')) 

I wrote TransformerMixinClass using pandas

 class CreateDummies(TransformerMixin): def transform(self, X, **transformparams): return pd.get_dummies(X).copy() def fit(self, X, y=None, **fitparams): return self 

fit_transform gives 2 columns for train data columns and 4 columns for test data. So it is not surprising, but not suitable for the pipeline

Similary, I tried to import the label encoder (and OneHotEncoder for potential next steps):

 from sklearn.preprocessing import LabelEncoder, OneHotEncoder le = LabelEncoder() le.fit_transform(train) le.transform(test) 

which leads, unsurprisingly, to error.

So, the problem here is that I need the information contained in the test case. Is there a good way to include this in the pipeline?

+5
source share
1 answer

You can use categorical expressions as described in this answer :

 categories = np.union1d(train, test) train = train.astype('category', categories=categories) test = test.astype('category', categories=categories) pd.get_dummies(train) Out: abcd 0 1 0 0 0 1 0 1 0 0 2 0 1 0 0 3 1 0 0 0 4 1 0 0 0 pd.get_dummies(test) Out: abcd 0 1 0 0 0 1 0 1 0 0 2 0 0 1 0 3 0 0 0 1 
+5
source

Source: https://habr.com/ru/post/1257572/


All Articles