Python Pandas NaN Values for Dataframe

Question

Python Pandas NaN Values for Dataframe

I am trying to populate NaN values in a data frame with values coming from the standard normal distribution. This is currently my code:

sqlStatement = "select * from sn.clustering_normalized_dataset" df = psql.frame_query(sqlStatement, cnx) data=df.pivot("user","phrase","tfw") dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1])) data[np.isnan(data)] = dfrand[np.isnan(data)]

After the dataframe is rotated, it looks like this:

 phrase aaron abbas abdul abe able abroad abu abuse \ user 14233664 NaN NaN NaN NaN NaN NaN NaN NaN 52602716 NaN NaN NaN NaN NaN NaN NaN NaN 123456789 NaN NaN NaN NaN NaN NaN NaN NaN 500158258 NaN NaN NaN NaN NaN NaN NaN NaN 517187571 0.4 NaN NaN 0.142857 1 0.4 0.181818 NaN

However, I need each NaN value to be replaced with a new random value. Thus, I created a new df consisting only of random values (dfrand), and then tried to swap the missing numbers (Nan) for the values from dfrand corresponding to the NaN indices. Well, unfortunately, this will not work - Although the expression

  np.isnan(data)

returns dataframe consists of True and False values, expression

  dfrand[np.isnan(data)]

returns only NaN values, so the general trick doesn't work. Any ideas what the problem is?

+5

python pandas random nan dataframe

user4045430 Dec 16 '14 at 14:33

source share

2 answers

tnknepp · Answer 1 · 2014-12-16T15:21:43+0000

Three thousand columns are not so many. How many lines do you have? You can always make a random data size of the same size and make a logical replacement (the size of your file system will determine if this is possible or not.

if you know the size of your data frame:

 import pandas as pd import numpy as np # create random dummy dataframe dfrand = pd.DataFrame(data=np.random.randn(rows,cols)) # import "real" dataframe data = pd.read_csv(etc.) # or however you choose to read it in # replace nans data[np.isnan(data)] = dfrand[np.isnan(data)]

if you don’t know the size of your data framework, just mix things around

 import pandas as pd import numpy as np # import "real" dataframe data = pd.read_csv(etc.) # or however you choose to read it in # create random dummy dataframe dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1])) # replace nans data[np.isnan(data)] = dfrand[np.isnan(data)]

EDIT By "users" last comment: "dfrand [np.isnan (data)] returns only NaN".

Right! And that is exactly what you wanted. In my solution, I have: data [np.isnan (data)] = dfrand [np.isnan (data)]. Translated, this means: take a randomly generated value from dfrand, which corresponds to the location of NaN in the "data", and insert it into the "data", where "data" is NaN. An example will help:

 a = pd.DataFrame(data=np.random.randint(0,100,(10,3))) a[0][5] = np.nan In [32]: a Out[33]: 0 1 2 0 2 26 28 1 14 79 82 2 89 32 59 3 65 47 31 4 29 59 15 5 NaN 58 90 6 15 66 60 7 10 19 96 8 90 26 92 9 0 19 23 # define randomly-generated dataframe, much like what you are doing, and replace NaN's b = pd.DataFrame(data=np.random.randint(0,100,(10,3))) In [39]: b Out[39]: 0 1 2 0 92 21 55 1 65 53 89 2 54 98 97 3 48 87 79 4 98 38 62 5 46 16 30 6 95 39 70 7 90 59 9 8 14 85 37 9 48 29 46 a[np.isnan(a)] = b[np.isnan(a)] In [38]: a Out[38]: 0 1 2 0 2 26 28 1 14 79 82 2 89 32 59 3 65 47 31 4 29 59 15 5 46 58 90 6 15 66 60 7 10 19 96 8 90 26 92 9 0 19 23

As you can see, all NaN in were replaced by a randomly generated value depending on the nan-value indices.

acushner · Answer 2 · 2014-12-16T14:52:10+0000

you can try something like this, assuming you are dealing with one series:

 ser = data['column_with_nulls_to_replace'] index = ser[ser.isnull()].index df = pd.DataFrame(np.random.randn(len(index)), index=index, columns=['column_with_nulls_to_replace']) ser.update(df)

Python Pandas NaN Values ​​for Dataframe

More articles:

Python Pandas NaN Values for Dataframe