How to get the first pandas DataFrame index for which multiple undefined columns are not null?

Question

How to get the first pandas DataFrame index for which multiple undefined columns are not null?

I have a dataframe with multiple columns. I want to get the index of the first row for which:

column A is not null
there are at least n other columns for which the values are not null

example: if my data framework is:

          Date          A             B      C     D
0   2015-01-02          NaN           1      1    NaN
1   2015-01-02          NaN           2      2    NaN
2   2015-01-02          NaN           3      3    NaN
3   2015-01-02          1            NaN     4    NaN
5   2015-01-02          NaN           2      NaN  NaN
6   2015-01-03          1            NaN     6    NaN
7   2015-01-03          1             1      6    NaN
8   2015-01-03          1             1      6     8

If n = 1, I would get 3

If n = 2, I would get 7

If n = 3, I would get 8

+4

python numpy pandas dataframe multiple-columns

Arthurim Apr 20 '17 at 12:59

source share

2 answers

A, NaN, count loc, sum notnull 1 A.

idxmax:

a = df.loc[df['A'].notnull(), 'A':].notnull().sum(axis=1).sub(1)
print (a)
3    1
6    1
7    2
8    3
dtype: int64

N = 1
print ((a == N).idxmax())
3

N = 2
print ((a == N).idxmax())
7

N = 3
print ((a == N).idxmax())
8

print (df.loc[df['A'].notnull(), 'A':])
     A    B    C    D
3  1.0  NaN  4.0  NaN
6  1.0  NaN  6.0  NaN
7  1.0  1.0  6.0  NaN
8  1.0  1.0  6.0  8.0

+3

jezrael 20 . '17 13:07

Divakar · Accepted Answer · 2017-04-20T13:15:53+0000

Here is one approach to get these indices for different n'sat a time -

def numpy_approach(df, reference='A'):
    df0 = df.iloc[:,df.columns != 'Date']
    valid_mask = df0.columns != reference
    mask = ~np.isnan(df0.values)
    count = mask[:,valid_mask].sum(1) * mask[:,(~valid_mask).argmax()]
    idx0 = np.searchsorted(np.maximum.accumulate(count),[1,2,3])
    return df.index[idx0]

Run Examples -

In [555]: df
Out[555]: 
         Date    A    B    C    D
0  2015-01-02  NaN  1.0  1.0  NaN
1  2015-01-02  NaN  2.0  2.0  NaN
2  2015-01-02  NaN  3.0  3.0  NaN
3  2015-01-02  1.0  NaN  4.0  NaN
5  2015-01-02  NaN  2.0  NaN  NaN
6  2015-01-03  1.0  NaN  6.0  NaN
7  2015-01-03  1.0  1.0  6.0  NaN
8  2015-01-03  1.0  1.0  6.0  8.0

In [556]: numpy_approach(df, reference='A')
Out[556]: Int64Index([3, 7, 8], dtype='int64')

In [557]: numpy_approach(df, reference='B')
Out[557]: Int64Index([0, 7, 8], dtype='int64')

In [558]: numpy_approach(df, reference='C')
Out[558]: Int64Index([0, 7, 8], dtype='int64')

In [568]: numpy_approach(df, reference='D')
Out[568]: Int64Index([8, 8, 8], dtype='int64')

How to get the first pandas DataFrame index for which multiple undefined columns are not null?

More articles: