Python pandas: best way to normalize data?

I have a large pandas framework with approximately 80 columns. Each of the 80 columns in the dataframe reports daily traffic statistics for websites (columns are websites).

Since I don’t want to work with raw traffic statistics, I prefer to normalize all my columns (except the first, which is the date). Either from 0 to 1, or (even better) from 0 to 100.

Date AB ... 10/10/2010 100.0 402.0 ... 11/10/2010 250.0 800.0 ... 12/10/2010 800.0 2000.0 ... 13/10/2010 400.0 1800.0 ... 

Saying, I wonder what normalization is applied. Min-Max scaling versus z-Score normalization (standardization)? Some of my columns have strong outliers. It would be great to have an example. I regret that I can not provide complete data.

+6
source share
1 answer

First rotate the Date column to the index.

 dates = df.pop('Date') df.index = dates 

Then either use z-score normalization:

 df1 = (df - df.mean())/df.std() 

or scaling min-max:

 df2 = (df-df.min())/(df.max()-df.min()) 

I would advise normalizing the z-score normalization, because min-max scaling is highly prone to outliers.

+14
source

Source: https://habr.com/ru/post/1011601/


All Articles