Pandas How to replace? with NaN - handling non-standard missing values

I am new to pandas, I am trying to load csv into a Dataframe. My data doesnโ€™t have the values โ€‹โ€‹represented as ?, and I'm trying to replace it with the standard missing values โ€‹โ€‹- NaN

Please help me with this. I tried reading through pandas docs, but I can not follow.

def readData(filename): DataLabels =["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "class"] # ==== trying to replace ? with Nan using na_values rawfile = pd.read_csv(filename, header=None, names=DataLabels, na_values=["?"]) age = rawfile["age"] print age print rawfile[25:40] #========trying to replace ? rawfile.replace("?", "NaN") print rawfile[25:40] 

The snap shot of the data

+10
source share
4 answers

You can replace this only for this column using replace :

 df['workclass'].replace('?', np.NaN) 

or for all df:

 df.replace('?', np.NaN) 

UPDATE

OK I understood your problem, by default, if you do not pass a separator character, then read_csv will use the comma ',' as the separator.

Your data and, in particular, one example where you have a problematic line:

 54, ?, 180211, Some-college, 10, Married-civ-spouse, ?, Husband, Asian-Pac-Islander, Male, 0, 0, 60, South, >50K 

it is actually a comma and space as a separator, so when you passed na_value=['?'] , it did not match, because all your values โ€‹โ€‹have a whitespace character in front of you that you cannot observe.

if you change your line as follows:

 rawfile = pd.read_csv(filename, header=None, names=DataLabels, sep=',\s', na_values=["?"]) 

then you should find that it all works:

 27 54 NaN 180211 Some-college 10 
+34
source

Use numpy.nan

Numpy - Replace Number With NaN

 import numpy as np df.applymap(lambda x: np.nan if x == '?' else x) 
+2
source

In order, I received it:

  #========trying to replace ? newraw= rawfile.replace('[?]', np.nan, regex=True) print newraw[25:40] 
+2
source

several times there will be gaps with? in a file created by systems such as Informatica or HANA

First you need to remove the spaces in the DataFrame

 temp_df_trimmed = temp_df.apply(lambda x: x.str.strip() if x.dtype == "object" else x) 

And later apply the function to replace the data

 temp_df_trimmed['RC'] = temp_df_trimmed['RC'].map(lambda x: np.nan if x=="?" else x) 
0
source

Source: https://habr.com/ru/post/984221/


All Articles