I follow this link to remove outliers, but here something is logically wrong.
Remove outliers in Pandas DataFrame with Percentiles
I have a dataset with the first column as "id" and the last column as "label".
Here is my code snippet. I delete the columns of labels and identifiers and then add it:
def processing_data(train_data,test_data): #computing percentiles. low = .05 high = .95 filt_df = train_data.loc[:, train_data.columns != 'id'] filt_df= filt_df.loc[:, filt_df.columns != 'label'] quant_df = filt_df.quantile([low, high]) print(quant_df) #filtering values based on computed percentiles. To do that use an apply by columns. print("Before removing outlier",filt_df,filt_df.shape) train_data1 = filt_df.apply(lambda x: x[(x>=quant_df.loc[low,x.name]) & (x <=quant_df.loc[high,x.name])], axis=0) print("After removing outlier,",train_data1,train_data1.shape) print(train_data1.isnull().sum()) train_data1= pd.concat([train_data.loc[:,'id'], train_data1], axis=1) train_data=pd.concat([train_data.loc[:,'label'], train_data1], axis=1) #train_data.dropna(inplace=True) #train_data.fillna(0) #test_data.fillna(0) #print(train_data) #print(np.isnan(train_data).any().sum()) return train_data,test_data
Conclusion: all rows contain some NaN values, and when I do train_data.dropna (inplace = True) all rows are discarded. Weird !!
How can i fix this? When I concatenate the id and column of columns after processing from the outside, do I feel that something is wrong there?
Here is the dataset:
id feature0 feature1 feature2 feature3 feature4 feature249 label 0 25.20824887 -16.7457484 50.86994402 5.593471686 1.188262678 1 1 -86.93144987 0.428227194 2.87483597 -8.064850183 6.056867093 2 2 42.16093367 7.85701304 151.6127571 9.639675583 5.570138511 0 3 20.66694385 8.680641918 -56.44917913 -9.814779803 -2.382979151 1 4 35.9466789 4.57373573 -28.16021186 -6.91297056 4.879375409 0