Imputer on some Dataframe columns in Python

Question

Imputer on some Dataframe columns in Python

I am learning how to use Imputer in Python.

This is my code:

df=pd.DataFrame([["XXL", 8, "black", "class 1", 22],
["L", np.nan, "gray", "class 2", 20],
["XL", 10, "blue", "class 2", 19],
["M", np.nan, "orange", "class 1", 17],
["M", 11, "green", "class 3", np.nan],
["M", 7, "red", "class 1", 22]])

df.columns=["size", "price", "color", "class", "boh"]

from sklearn.preprocessing import Imputer

imp=Imputer(missing_values="NaN", strategy="mean" )
imp.fit(df["price"])

df["price"]=imp.transform(df["price"])

However, this results in the following error: ValueError: the length of the values does not match the length of the index

What is wrong with my code ???

thanks for the help

+8

python scikit-learn missing-data imputation

Mauro gentile Jul 26 '16 at 7:59

source share

4 answers

frist · Answer 1 · 2016-07-26T10:49:45+0000

This is because it is Imputercommonly used with DataFrames, not a series. Possible Solution:

imp=Imputer(missing_values="NaN", strategy="mean" )
imp.fit(df[["price"]])
df["price"]=imp.transform(df[["price"]]).ravel()

# Or even 
imp=Imputer(missing_values="NaN", strategy="mean" )
df["price"]=imp.fit_transform(df[["price"]]).ravel()

Ryan · Answer 2 · 2016-07-26T08:20:18+0000

I think you want to specify the axis for the importer, and then transpose the returned array:

import pandas as pd
import numpy as np

df=pd.DataFrame([["XXL", 8, "black", "class 1", 22],
["L", np.nan, "gray", "class 2", 20],
["XL", 10, "blue", "class 2", 19],
["M", np.nan, "orange", "class 1", 17],
["M", 11, "green", "class 3", np.nan],
["M", 7, "red", "class 1", 22]])

df.columns=["size", "price", "color", "class", "boh"]

from sklearn.preprocessing import Imputer

imp=Imputer(missing_values="NaN", strategy="mean",axis=1 ) #specify axis
q = imp.fit_transform(df["price"]).T #perform a transpose operation


df["price"]=q
print df

Sachin Prabhu · Answer 3 · 2018-08-19T06:23:19+0000

-

df=pd.DataFrame([["XXL", 8, "black", "class 1", 22],
["L", np.nan, "gray", "class 2", 20],
["XL", 10, "blue", "class 2", 19],
["M", np.nan, "orange", "class 1", 17],
["M", 11, "green", "class 3", np.nan],
["M", 7, "red", "class 1", 22]])

df.columns=["size", "price", "color", "class", "boh"]

from sklearn.preprocessing import Imputer

imp=Imputer(missing_values="NaN", strategy="mean" )
imp.fit(df[["price"]])

df["price"]=imp.transform(df[["price"]])

df['boh'] = imp.fit_transform(df[['price']])

DataFrame

shinchaan · Answer 4 · 2019-05-31T12:23:29+0000

Below is the documentation for Simple Imputer . For the fit method, it uses an array-like or sparse metric as an input parameter. You can try this:

imp.fit(df.iloc[:,1:2]) 
df['price']=imp.transform(df.iloc[:,1:2])

specify the location of the index to match the method, and then apply the transformation.

>>> df
   size  price   color    class   boh
 0  XXL    8.0   black  class 1  22.0
 1    L    9.0    gray  class 2  20.0
 2   XL   10.0    blue  class 2  19.0
 3    M    9.0  orange  class 1  17.0
 4    M   11.0   green  class 3   NaN
 5    M    7.0     red  class 1  22.0

Just like you can do for boh

imp.fit(df.iloc[:,4:5])
df['price']=imp.transform(df.iloc[:,4:5])
>>> df
    size  price   color    class   boh
 0  XXL    8.0   black  class 1  22.0
 1    L    9.0    gray  class 2  20.0
 2   XL   10.0    blue  class 2  19.0
 3    M    9.0  orange  class 1  17.0
 4    M   11.0   green  class 3  20.0
 5    M    7.0     red  class 1  22.0

Please correct me if I am wrong. Suggestions will be appreciated.

Imputer on some Dataframe columns in Python

More articles: