Saving NaN with pandas data inequalities

Question

Saving NaN with pandas data inequalities

I have a pandas.DataFrame object containing about 100 columns and 200,000 rows of data. I am trying to convert it to a bool dataframe, where True means that the value is greater than the threshold, False means that it is less, and NaN values are supported.

If there are no NaN values, it takes me about 60 ms:

df >= threshold

But when I try to deal with NaNs, the method below works, but very slow (20 seconds).

 def func(x): if x >= threshold: return True elif x < threshold: return False else: return x df.apply(lambda x: x.apply(lambda x: func(x)))

Is there a faster way?

+5

python pandas

jsignell Nov 18 '15 at 14:44

source share

3 answers

You can check for NaN separately using this post: Python - find integer index of strings with NaN in pandas

 df.isnull()

Combine the isnull output with df >= threshold using bitwise or:

 df.isnull() | df >= threshold

You can expect the two masks to approach 200 ms for calculation and combining, but this should be far enough from 20 seconds to be okay.

+1

Mad physicist Nov 18 '15 at 2:51

source share

In this situation, I use the indicator array float, encoded as: 0 = False, 1 = True and NaN = missing. A Pandas DataFrame with a bool dtype cannot have missing values, and a DataFrame with an object dtype containing a combination of Python bool and float objects is inefficient. This leads to the use of DataFrames with np.float64 dtype. numpy.sign(x - threshold) gives -1 = (x <threshold), 0 = (x == threshold) and +1 = (x> threshold) for your comparison, which may be good enough for your purposes, but if you really need 0/1, the conversion can be done in place. The dates below are presented in an array of 200 KB x length:

 In [45]: %timeit y = (x > 0); y[pd.isnull(x)] = np.nan 100 loops, best of 3: 8.71 ms per loop In [46]: %timeit y = np.sign(x) 100 loops, best of 3: 1.82 ms per loop In [47]: %timeit y = np.sign(x); y += 1; y /= 2 100 loops, best of 3: 3.78 ms per loop

0

Kerby shedden Nov 28 '15 at 3:49

source share

ocefpaf · Accepted Answer · 2015-11-19T14:21:50+0000

You can do:

 new_df = df >= threshold new_df[df.isnull()] = np.NaN

But this is different from what you get with the apply method. Here your mask has a float dtype containing NaN, 0.0 and 1.0. In the application solution, you get an object dtype with NaN, False, and True.

In no way should you use it as a mask, because you cannot get what you want. IEEE says that any NaN comparison should yield False, and the apply method implicitly violates this by returning NaN!

The best option is to track the NaN separately, and df.isnull () is pretty fast when setting a bottleneck.

Saving NaN with pandas data inequalities

More articles: