How to select rows and replace some columns in pandas

import pandas as pd
dic = {'A': [np.nan, 4, np.nan, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]}
df = pd.DataFrame(dic)
df 

If I have data like below

     A  B   C
0   NaN 9   0
1   4.0 2   0
2   NaN 5   5
3   4.0 3   3

I want to select the raw column A NaNand replace the value of column B with np.nan as follows.

    A   B   C
0   NaN NaN 0
1   4.0 2.0 0
2   NaN NaN 5
3   4.0 3.0 3

I tried to do it df[df.A.isna()]["B"]=np.nan, but it didn’t work.
According to this page , I have to select the data df.iloc. But the problem is that if df has many rows, I cannot select data by input index.

+4
source share
3 answers

Option 1:
Actually, you were pretty close. Use pd.Series.isnullon Aand assign values Bwith df.loc:

df.loc[df.A.isnull(), 'B'] = np.nan
df

     A    B  C
0  NaN  NaN  0
1  4.0  2.0  0
2  NaN  NaN  5
3  4.0  3.0  3

Option 2
np.where :

df['B'] = np.where(df.A.isnull(), np.nan, df.B)
df

     A    B  C
0  NaN  NaN  0
1  4.0  2.0  0
2  NaN  NaN  5
3  4.0  3.0  3
+5
source

mask where - NaN s:

df['B'] = df.B.mask(df.A.isnull())

df['B'] = df.B.where(df.A.notnull())

numpy.where - :

df['B'] = np.where(df.A.isnull(), np.nan, df.B)

print (df)
     A    B  C
0  NaN  NaN  0
1  4.0  2.0  0
2  NaN  NaN  5
3  4.0  3.0  3

dic = {'A': [np.nan, 4, np.nan, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]}
df = pd.DataFrame(dic)
df = pd.concat([df] * 10000, ignore_index=True)


In [61]: %timeit df['B'] = np.where(df.A.isnull(), np.nan, df.B)
The slowest run took 7.57 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 405 µs per loop

In [62]: %timeit df['B'] = df.B.mask(df.A.isnull())
The slowest run took 70.14 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 3.54 ms per loop

In [63]: %timeit df['B'] = df.B.where(df.A.notnull())
The slowest run took 5.65 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.04 ms per loop

In [65]: %timeit df.B += df.A * 0
The slowest run took 12.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 913 µs per loop

In [67]: %timeit df.loc[df.A.isnull(), 'B'] = np.nan
The slowest run took 4.56 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.88 ms per loop
+5

Because my colleagues made a logical choice ...

df.B += df.A * 0
df

     A    B  C
0  NaN  NaN  0
1  4.0  2.0  0
2  NaN  NaN  5
3  4.0  3.0  3
+5
source

Source: https://habr.com/ru/post/1693858/


All Articles