Pre-processing function of both continuous and categorical variables (integer type) with scikit-learn

The main goals are as follows:

1) Apply StandardScalerto continuous variables

2) Apply LabelEncoderand OnehotEncoderfor categorical variables

Continuous variables must be scaled, but at the same time, a pair of categorical variables also has an integer type. Application StandardScalerwill lead to undesirable effects.

On the other hand, it StandardScalerwill scale integer categorical variables, which is also not what we are.

Since continuous and categorical variables are mixed in one PandasDataFrame, what is the recommended workflow to solve this problem?

The best example to illustrate my point is the Kaggle Bike Sharing Demand dataset , where seasonthey weatherare integer categorical variables.

+8
source share
1 answer

Check the meta converter . Use it as the first step in your pipeline to perform data processing operations column by column: sklearn_pandas.DataFrameMapper

mapper = DataFrameMapper(
  [(continuous_col, StandardScaler()) for continuous_col in continuous_cols] +
  [(categorical_col, LabelBinarizer()) for categorical_col in categorical_cols]
)
pipeline = Pipeline(
  [("mapper", mapper),
  ("estimator", estimator)]
)
pipeline.fit_transform(df, df["y"])

In addition, you should use sklearn.preprocessing.LabelBinarizerinstead of a list [LabelEncoder(), OneHotEncoder()].

+10
source

Source: https://habr.com/ru/post/1675390/


All Articles