Using multiple custom classes with Pipeline sklearn (Python)

Question

Using multiple custom classes with Pipeline sklearn (Python)

I am trying to make a Pipeline textbook for students, but I am blocking. I am not an expert, but I am trying to improve. So thanks for your indulgence. In fact, I am trying to execute a pipeline to perform several steps when preparing a data block for a classifier:

Step 1: Data Frame Description
Step 2: Fill in NaN Values
Step 3: Convert Categorical Values to Numbers

Here is my code:

class Descr_df(object):

    def transform (self, X):
        print ("Structure of the data: \n {}".format(X.head(5)))
        print ("Features names: \n {}".format(X.columns))
        print ("Target: \n {}".format(X.columns[0]))
        print ("Shape of the data: \n {}".format(X.shape))

    def fit(self, X, y=None):
        return self

class Fillna(object):

    def transform(self, X):
        non_numerics_columns = X.columns.difference(X._get_numeric_data().columns)
        for column in X.columns:
            if column in non_numerics_columns:
                X[column] = X[column].fillna(df[column].value_counts().idxmax())
            else:
                 X[column] = X[column].fillna(X[column].mean())            
        return X

    def fit(self, X,y=None):
        return self

class Categorical_to_numerical(object):

    def transform(self, X):
        non_numerics_columns = X.columns.difference(X._get_numeric_data().columns)
        le = LabelEncoder()
        for column in non_numerics_columns:
            X[column] = X[column].fillna(X[column].value_counts().idxmax())
            le.fit(X[column])
            X[column] = le.transform(X[column]).astype(int)
        return X

    def fit(self, X, y=None):
        return self

If I perform steps 1 and 2 or steps 1 and 3, this works, but if I perform steps 1, 2 and 3 at the same time. I have this error:

pipeline = Pipeline([('df_intropesction', Descr_df()), ('fillna',Fillna()), ('Categorical_to_numerical', Categorical_to_numerical())])
pipeline.fit(X, y)
AttributeError: 'NoneType' object has no attribute 'columns'

+6

python pandas scikit-learn machine-learning pipeline

Jeremie guez Apr 19 '17 at 14:57

source share

1

Vivek Kumar · Accepted Answer · 2017-04-19T16:19:38+0000

- , , ...

:

, , .

, :

Descr_df.fit(X) → self
newX = Descr_df.transform(X) → newX, , ( ). None
Fillna.fit(newX) → self
Fillna.transform(newX) → newX.columns. newX = None from step2. .

. Descr_df, :

def transform (self, X):
    print ("Structure of the data: \n {}".format(X.head(5)))
    print ("Features names: \n {}".format(X.columns))
    print ("Target: \n {}".format(X.columns[0]))
    print ("Shape of the data: \n {}".format(X.shape))
    return X

. Base Estimator Transformer scikit, .

class Descr_df(object) class Descr_df(BaseEstimator, TransformerMixin), Fillna(object) Fillna(BaseEstimator, TransformerMixin) ..

. Pipeline:

http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py

Using multiple custom classes with Pipeline sklearn (Python)

More articles: