Custom transformer for sklearn Pipeline, which modifies both X and y

Question

Custom transformer for sklearn Pipeline, which modifies both X and y

I want to create my own transformer for use with the sklearn conveyor. Therefore, I create a class that implements the fit and transform methods. The goal of the transformer is to remove rows from the matrix with more than a certain amount of NaN. So the problem I am facing is how to change both the matrices X and y that are transmitted to the transformer? I believe that this should be done in the fitting method, since it has access to both X and K. Since python passes assignment arguments, as soon as I reassign X to a new matrix with fewer rows, the link to the original X will be lost ( and of course this is true for y). Can this link be saved?

Im using pandas DataFrame to easily drop rows with too much NaN, this may be the wrong way to do this for my use case. The current code is as follows:

class Dropna(): # thresh is max number of NaNs allowed in a row def __init__(self, thresh=0): self.thresh = thresh def fit(self, X, y): total = X.shape[1] # +1 to account for 'y' being added to the dframe new_thresh = total + 1 - self.thresh df = pd.DataFrame(X) df['y'] = y df.dropna(thresh=new_thresh, inplace=True) X = df.drop('y', axis=1).values y = df['y'].values return self def transform(self, X): return X

+6

python numpy scikit-learn machine-learning data-analysis

Markaward Aug 28 '14 at 1:20

source share

3 answers

This can be easily solved using the sklearn.preprocessing.FunctionTransformer method ( http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html )

You just need to put your rotations in X in the function

 def drop_nans(X, y=None): total = X.shape[1] new_thresh = total - thresh df = pd.DataFrame(X) df.dropna(thresh=new_thresh, inplace=True) return df.values

then you get your transformer by calling

 transformer = FunctionTransformer(drop_nans, validate=False)

which you can use in the pipeline. The threshold can be set outside the drop_nans function.

+3

MaxBenChrist Jan 05 '16 at 11:23

source share

Use "deep copies" further down the pipeline and X , y remain protected

.fit() can first assign a deep copy for each call to new class variables

 self.X_without_NaNs = X.copy() self.y_without_NaNs = y.copy()

and then reduce / convert them so that they do not have more NaN -s than are ordered by self.treshold

+1

user3666197 Aug 28 '14 at 2:26

source share

eickenberg · Accepted Answer · 2014-08-28T11:47:46+0000

Changing the axis of the sample, for example. sample removal, doesn’t (yet?) match the scikit-learn transformer API. Therefore, if you need to do this, you must do this outside of any scikit learn calls, as preprocessing.

As now, the transformer API is used to transform the functions of this sample into something new. It may implicitly contain information from other samples, but samples are never deleted.

Another option is to try to enter the missing values. But then again, if you need to remove samples, treat them as preprocessing before using scikit learn.

Custom transformer for sklearn Pipeline, which modifies both X and y

More articles: