Pandas - Find the longest stretch without Nan values

Question

Pandas - Find the longest stretch without Nan values

I have a pandas dataframe "df", a sample of which is below:

   time  x
0  1     1
1  2     Nan 
2  3     3
3  4     Nan
4  5     8
5  6     7
6  7     5
7  8     Nan

The real frame is much larger. I am trying to find the longest stretch of non-NaN values in the "x" series and print the start and end indices for this frame. Is it possible?

thank

+4

python numpy pandas

Jeff saltfist Jan 05 '17 at 20:48

source share

5 answers

, NaN :

import numpy as np

index = df['x'].index[df['x'].apply(np.isnan)]
df_index = df.index.values.tolist()
[df_index.index(indexValue) for indexValue in index]

>>> [0, 1, 3, 7]

, , NaN.

+4

Greg Lever 05 . '17 20:55

pandas

f = dict(
    Start=pd.Series.first_valid_index,
    Stop=pd.Series.last_valid_index,
    Stretch='count'
)

agged = df.x.groupby(df.x.isnull().cumsum()).agg(f)
agged.loc[agged.Stretch.idxmax(), ['Start', 'Stop']].values

array([ 4.,  6.])

numpy

def pir(x):
    # pad with np.nan
    x = np.append(np.nan, np.append(x, np.nan))
    # find where null
    w = np.where(np.isnan(x))[0]
    # diff to find length of stretch
    # argmax to find where largest stretch
    a = np.diff(w).argmax()
    # return original positions of boundary nulls
    return w[[a, a + 1]] + np.array([0, -2])

pir(df.x.values)

array([4, 6])

a = np.array([1, np.nan, 3, np.nan, np.nan, np.nan, np.nan, 8, 7, 5, 2, 5, np.nan, np.nan])
pir(a)

array([ 7, 11])

+3

piRSquared 05 . '17 21:15

, (, , , ):

In [19]: df = pd.DataFrame({'time':[1,2,3,4,5,6,7,8],'x':[1,np.NAN,3,np.NAN,8,7,5,np.NAN]})

In [20]: index = df['x'].isnull()

In [21]: df[index].index.values
Out[21]: array([1, 3, 7])

+1

dleal 05 . '17 21:03

- scipy.ndimage.measurements.label. -. .

import pandas as pd
import numpy as np
from scipy.ndimage.measurements import label
df = pd.DataFrame({'time':[1,2,3,4,5,6,7,8],'x':[1,np.NAN,3,np.NAN,8,7,5,np.NAN]})

Getting the longest stretch without nan

valid_rows = ~df.isnull().any(axis=1)
label, num_feature = label(valid_rows)
label_of_biggest_group =  valid_rows.groupby(label).count().drop(0).argmax()
print df.loc[label == label_of_biggest_group]

Result

   time    x
4     5  8.0
5     6  7.0
6     7  5.0

Note

Label 0 contains background data in our case of values nan, and you need to discard it if your number is greater than or equal to the size of your largest group. num_feature- your number of uniform stretch marks without nan.

+1

Delforge Nov 02 '17 at 15:54

source share

Divakar · Accepted Answer · 2017-01-05T21:06:46+0000

It uses a vector approach with NumPy tools -

a = df.x.values  # Extract out relevant column from dataframe as array
m = np.concatenate(( [True], np.isnan(a), [True] ))  # Mask
ss = np.flatnonzero(m[1:] != m[:-1]).reshape(-1,2)   # Start-stop limits
start,stop = ss[(ss[:,1] - ss[:,0]).argmax()]  # Get max interval, interval limits

Run Example -

In [474]: a
Out[474]: 
array([  1.,  nan,   3.,  nan,  nan,  nan,  nan,   8.,   7.,   5.,   2.,
         5.,  nan,  nan])

In [475]: start, stop
Out[475]: (7, 12)

The intervals are set so that the difference between each start and stop gives us the length of each interval. So, ending indexif you want to get the last index of a nonzero element, we need to subtract it from stop.

Pandas - Find the longest stretch without Nan values

More articles: