Remove outliers in Pandas DataFrame with Percentiles

I have a DataFrame DF with 40 columns and many records.

DF:

User_id | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 |...| Col39 

For each column except the user_id column, I want to check outliers and delete the entire record if outlier appears.

To identify outliers in each row, I decided to simply use the 5th and 95th percentiles (I know that this is not the best statistical way):

The code that I have is so far:

 P = np.percentile(df.Col1, [5, 95]) new_df = df[(df.Col1 > P[0]) & (df.Col1 < P[1])] 

Question : How can I apply this approach to all columns (except User_id ) without doing it manually? My goal is to get a data frame without records that would have outliers.

Thanks!

+8
source share
3 answers

The initial data set.

 print(df.head()) Col0 Col1 Col2 Col3 Col4 User_id 0 49 31 93 53 39 44 1 69 13 84 58 24 47 2 41 71 2 43 58 64 3 35 56 69 55 36 67 4 64 24 12 18 99 67 

First removing the User_id column

 filt_df = df.loc[:, df.columns != 'User_id'] 

Then calculate the percentiles.

 low = .05 high = .95 quant_df = filt_df.quantile([low, high]) print(quant_df) Col0 Col1 Col2 Col3 Col4 0.05 2.00 3.00 6.9 3.95 4.00 0.95 95.05 89.05 93.0 94.00 97.05 

The following filtering values ​​are based on calculated percentiles. To do this, I use apply columns by columns and what it is!

 filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) & (x < quant_df.loc[high,x.name])], axis=0) 

Return User_id back.

 filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1) 

Finally, strings with NaN values ​​can be omitted just like that.

 filt_df.dropna(inplace=True) print(filt_df.head()) User_id Col0 Col1 Col2 Col3 Col4 1 47 69 13 84 58 24 3 67 35 56 69 55 36 5 9 95 79 44 45 69 6 83 69 41 66 87 6 9 87 50 54 39 53 40 

Result of checking

 print(filt_df.head()) User_id Col0 Col1 Col2 Col3 Col4 0 44 49 31 NaN 53 39 1 47 69 13 84 58 24 2 64 41 71 NaN 43 58 3 67 35 56 69 55 36 4 67 64 24 12 18 NaN print(filt_df.describe()) User_id Col0 Col1 Col2 Col3 Col4 count 100.000000 89.000000 88.000000 88.000000 89.000000 89.000000 mean 48.230000 49.573034 45.659091 52.727273 47.460674 57.157303 std 28.372292 25.672274 23.537149 26.509477 25.823728 26.231876 min 0.000000 3.000000 5.000000 7.000000 4.000000 5.000000 25% 23.000000 29.000000 29.000000 29.500000 24.000000 36.000000 50% 47.000000 50.000000 40.500000 52.500000 49.000000 59.000000 75% 74.250000 69.000000 67.000000 75.000000 70.000000 79.000000 max 99.000000 95.000000 89.000000 92.000000 91.000000 97.000000 

How to generate a test dataset

 np.random.seed(0) nb_sample = 100 num_sample = (0,100) d = dict() d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample) for i in range(5): d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample) df = DataFrame.from_dict(d) 
+18
source

What you are describing is similar to the Winsorizing process, which truncates values ​​(for example, at the 5th and 95th percentiles) instead of completely deleting them.

Here is an example:

 import pandas as pd from scipy.stats import mstats %matplotlib inline test_data = pd.Series(range(30)) test_data.plot() 

Original data

 # Truncate values to the 5th and 95th percentiles transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05])) transformed_test_data.plot() 

Winsorized data

+4
source

Use an inner join. Something like this should work

 cols = df.columns.tolist() cols.remove('user_id') #remove user_id from list of columns P = np.percentile(df[cols[0]], [5, 95]) new_df = df[(df[cols[0] > P[0]) & (df[cols[0]] < P[1])] for col in cols[1:]: P = np.percentile(df[col], [5, 95]) new_df = new_df.join(df[(df[col] > P[0]]) & (df[col] < P[1])], how='inner') 
+1
source

Source: https://habr.com/ru/post/1276122/


All Articles