Imputing data with fancyimpute and pandas

I have a great fame for these df pandas. He has a lot of misses. Dropping a line / or by count is not an option. Imputation of medians, means or the most common values ​​is also not an option (therefore, imputation with pandas and / or scikit unfortunately not scikit ).

I came across what seems like a neat package called fancyimpute (you can find it here ). But I have some problems with this.

That's what I'm doing:

 #the neccesary imports import pandas as pd import numpy as np from fancyimpute import KNN # df is my data frame with the missings. I keep only floats df_numeric = = df.select_dtypes(include=[np.float]) # I now run fancyimpute KNN, # it returns a np.array which I store as a pandas dataframe df_filled = pd.DataFrame(KNN(3).complete(df_numeric)) 

However, df_filled somehow a single vector instead of a populated data frame. How do I get an imputed data frame?

Refresh

I realized that fancyimpute needs a numpay array . So I converted df_numeric to an array using as_matrix() .

 # df is my data frame with the missings. I keep only floats df_numeric = df.select_dtypes(include=[np.float]).as_matrix() # I now run fancyimpute KNN, # it returns a np.array which I store as a pandas dataframe df_filled = pd.DataFrame(KNN(3).complete(df_numeric)) 

The output is a data frame with missing column labels. Is there any way to get shortcuts?

+13
source share
5 answers
 df=pd.DataFrame(data=mice.complete(d), columns=d.columns, index=d.index) 

np.array returned by the .complete() method of the .complete() object (whether mouse or KNN) is served as the content (argument data=) pandas frame, whose columns and indices are the same as the original data frame.

+2
source

Add the following lines after the code:

 df_filled.columns = df_numeric.columns df_filled.index = df_numeric.index 
+6
source

I see disappointment from bizarre imputation and pandas. Here is a fairly simple wrapper using the recursive override method. Receives and displays a data frame - column names are not corrupted. These wrappers work well with pipelines.

 from fancyimpute import SoftImpute class SoftImputeDf(SoftImpute): """DataFrame Wrapper around SoftImpute""" def __init__(self, shrinkage_value=None, convergence_threshold=0.001, max_iters=100,max_rank=None,n_power_iterations=1,init_fill_method="zero", min_value=None,max_value=None,normalizer=None,verbose=True): super(SoftImputeDf, self).__init__(shrinkage_value=shrinkage_value, convergence_threshold=convergence_threshold, max_iters=max_iters,max_rank=max_rank, n_power_iterations=n_power_iterations, init_fill_method=init_fill_method, min_value=min_value,max_value=max_value, normalizer=normalizer,verbose=False) def fit_transform(self, X, y=None): assert isinstance(X, pd.DataFrame), "Must be pandas dframe" for col in X.columns: if X[col].isnull().sum() < 10: X[col].fillna(0.0, inplace=True) z = super(SoftImputeDf, self).fit_transform(X.values) return pd.DataFrame(z, index=X.index, columns=X.columns) 
+4
source

Can I get code for imputing missing data using KNN in Python

0
source

I really appreciate @ jander081's approach and expanded it a bit to deal with setting categorical columns. I had a problem when categorical columns might not work and create errors during training, so I changed the code as follows:

 from fancyimpute import SoftImpute import pandas as pd class SoftImputeDf(SoftImpute): """DataFrame Wrapper around SoftImpute""" def __init__(self, shrinkage_value=None, convergence_threshold=0.001, max_iters=100,max_rank=None,n_power_iterations=1,init_fill_method="zero", min_value=None,max_value=None,normalizer=None,verbose=True): super(SoftImputeDf, self).__init__(shrinkage_value=shrinkage_value, convergence_threshold=convergence_threshold, max_iters=max_iters,max_rank=max_rank, n_power_iterations=n_power_iterations, init_fill_method=init_fill_method, min_value=min_value,max_value=max_value, normalizer=normalizer,verbose=False) def fit_transform(self, X, y=None): assert isinstance(X, pd.DataFrame), "Must be pandas dframe" for col in X.columns: if X[col].isnull().sum() < 10: X[col].fillna(0.0, inplace=True) z = super(SoftImputeDf, self).fit_transform(X.values) df = pd.DataFrame(z, index=X.index, columns=X.columns) cats = list(X.select_dtypes(include='category')) df[cats] = df[cats].astype('category') # return pd.DataFrame(z, index=X.index, columns=X.columns) return df 
0
source

Source: https://habr.com/ru/post/1270063/


All Articles