Pandas: How to conditionally assign multiple columns?

I want to replace negative values ​​with nan only for specific columns. The easiest way:

 for col in ['a', 'b', 'c']: df.loc[df[col ] < 0, col] = np.nan 

df can have many columns, and I want to do this only for certain columns.

Is there a way to do this on a single line? It sounds like it should be easy, but I couldn't figure it out.

+5
source share
6 answers

I do not think that you will be much simpler than this:

 >>> df = pd.DataFrame({'a': np.arange(-5, 2), 'b': np.arange(-5, 2), 'c': np.arange(-5, 2), 'd': np.arange(-5, 2), 'e': np.arange(-5, 2)}) >>> df abcde 0 -5 -5 -5 -5 -5 1 -4 -4 -4 -4 -4 2 -3 -3 -3 -3 -3 3 -2 -2 -2 -2 -2 4 -1 -1 -1 -1 -1 5 0 0 0 0 0 6 1 1 1 1 1 >>> df[df[cols] < 0] = np.nan >>> df abcde 0 NaN NaN NaN -5 -5 1 NaN NaN NaN -4 -4 2 NaN NaN NaN -3 -3 3 NaN NaN NaN -2 -2 4 NaN NaN NaN -1 -1 5 0.0 0.0 0.0 0 0 6 1.0 1.0 1.0 1 1 
+11
source

use loc and where

 cols = ['a', 'b', 'c'] df.loc[:, cols] = df[cols].where(df[cols].where.ge(0), np.nan) 

demonstration

 df = pd.DataFrame(np.random.randn(10, 5), columns=list('abcde')) df 

enter image description here

 cols = list('abc') df.loc[:, cols] = df[cols].where(df[cols].ge(0), np.nan) df 

enter image description here


You can speed it up with numpy

 df[cols] = np.where(df[cols] < 0, np.nan, df[cols]) 

do the same thing.


time

 def gen_df(n): return pd.DataFrame(np.random.randn(n, 5), columns=list('abcde')) 

since assignment is an important part of this, I create df from scratch every loop. I also added time to create df .

for n = 10000

enter image description here

for n = 100000

enter image description here

+6
source

Here is the way:

 df[df.columns.isin(['a', 'b', 'c']) & (df < 0)] = np.nan 
+5
source

You can use np.where to achieve this:

 In [47]: df = pd.DataFrame(np.random.randn(5,5), columns=list('abcde')) df Out[47]: abcde 0 0.616829 -0.933365 -0.735308 0.665297 -1.333547 1 0.069158 2.266290 -0.068686 -0.787980 -0.082090 2 1.203311 1.661110 -1.227530 -1.625526 0.045932 3 -0.247134 -1.134400 0.355436 0.787232 -0.474243 4 0.131774 0.349103 -0.632660 -1.549563 1.196455 In [48]: df[['a','b','c']] = np.where(df[['a','b','c']] < 0, np.NaN, df[['a','b','c']]) df Out[48]: abcde 0 0.616829 NaN NaN 0.665297 -1.333547 1 0.069158 2.266290 NaN -0.787980 -0.082090 2 1.203311 1.661110 NaN -1.625526 0.045932 3 NaN NaN 0.355436 0.787232 -0.474243 4 0.131774 0.349103 NaN -1.549563 1.196455 
+4
source

Of course, just select the masks you need from the mask:

 (df < 0)[['a', 'b', 'c']] 

You can use this mask in df[(df < 0)[['a', 'b', 'c']]] = np.nan .

+3
source

If it should be single line:

 df[['a', 'b', 'c']] = df[['a', 'b', 'c']].apply(lambda c: [x>0 and x or np.nan for x in c]) 
+1
source

Source: https://habr.com/ru/post/1258349/


All Articles