Pandas How to replace? with NaN - handling non-standard missing values

Question

Pandas How to replace? with NaN - handling non-standard missing values

I am new to pandas, I am trying to load csv into a Dataframe. My data doesn’t have the values represented as ?, and I'm trying to replace it with the standard missing values - NaN

Please help me with this. I tried reading through pandas docs, but I can not follow.

def readData(filename): DataLabels =["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "class"] # ==== trying to replace ? with Nan using na_values rawfile = pd.read_csv(filename, header=None, names=DataLabels, na_values=["?"]) age = rawfile["age"] print age print rawfile[25:40] #========trying to replace ? rawfile.replace("?", "NaN") print rawfile[25:40]

The snap shot of the data

+10

python pandas

swati saoji Mar 25 '15 at 4:52

source share

4 answers

Use numpy.nan

Numpy - Replace Number With NaN

 import numpy as np df.applymap(lambda x: np.nan if x == '?' else x)

+2

Liam Foley Mar 25 '15 at 5:07

source share

In order, I received it:

  #========trying to replace ? newraw= rawfile.replace('[?]', np.nan, regex=True) print newraw[25:40]

+2

swati saoji Mar 25 '15 at 5:11

source share

several times there will be gaps with? in a file created by systems such as Informatica or HANA

First you need to remove the spaces in the DataFrame

 temp_df_trimmed = temp_df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

And later apply the function to replace the data

 temp_df_trimmed['RC'] = temp_df_trimmed['RC'].map(lambda x: np.nan if x=="?" else x)

0

Nishanth Sep 05 '19 at 9:01

source share

Edchum · Accepted Answer · 2015-03-25T08:50:40+0000

You can replace this only for this column using replace :

 df['workclass'].replace('?', np.NaN)

or for all df:

 df.replace('?', np.NaN)

UPDATE

OK I understood your problem, by default, if you do not pass a separator character, then read_csv will use the comma ',' as the separator.

Your data and, in particular, one example where you have a problematic line:

 54, ?, 180211, Some-college, 10, Married-civ-spouse, ?, Husband, Asian-Pac-Islander, Male, 0, 0, 60, South, >50K

it is actually a comma and space as a separator, so when you passed na_value=['?'] , it did not match, because all your values have a whitespace character in front of you that you cannot observe.

if you change your line as follows:

 rawfile = pd.read_csv(filename, header=None, names=DataLabels, sep=',\s', na_values=["?"])

then you should find that it all works:

 27 54 NaN 180211 Some-college 10

Pandas How to replace? with NaN - handling non-standard missing values

More articles: