Pandas - Find the longest stretch without Nan values

I have a pandas dataframe "df", a sample of which is below:

   time  x
0  1     1
1  2     Nan 
2  3     3
3  4     Nan
4  5     8
5  6     7
6  7     5
7  8     Nan

The real frame is much larger. I am trying to find the longest stretch of non-NaN values ​​in the "x" series and print the start and end indices for this frame. Is it possible?

thank

+4
source share
5 answers

It uses a vector approach with NumPy tools -

a = df.x.values  # Extract out relevant column from dataframe as array
m = np.concatenate(( [True], np.isnan(a), [True] ))  # Mask
ss = np.flatnonzero(m[1:] != m[:-1]).reshape(-1,2)   # Start-stop limits
start,stop = ss[(ss[:,1] - ss[:,0]).argmax()]  # Get max interval, interval limits

Run Example -

In [474]: a
Out[474]: 
array([  1.,  nan,   3.,  nan,  nan,  nan,  nan,   8.,   7.,   5.,   2.,
         5.,  nan,  nan])

In [475]: start, stop
Out[475]: (7, 12)

The intervals are set so that the difference between each start and stop gives us the length of each interval. So, ending indexif you want to get the last index of a nonzero element, we need to subtract it from stop.

+3

, NaN :

import numpy as np

index = df['x'].index[df['x'].apply(np.isnan)]
df_index = df.index.values.tolist()
[df_index.index(indexValue) for indexValue in index]

>>> [0, 1, 3, 7]

, , NaN.

+4

pandas

f = dict(
    Start=pd.Series.first_valid_index,
    Stop=pd.Series.last_valid_index,
    Stretch='count'
)

agged = df.x.groupby(df.x.isnull().cumsum()).agg(f)
agged.loc[agged.Stretch.idxmax(), ['Start', 'Stop']].values

array([ 4.,  6.])

numpy

def pir(x):
    # pad with np.nan
    x = np.append(np.nan, np.append(x, np.nan))
    # find where null
    w = np.where(np.isnan(x))[0]
    # diff to find length of stretch
    # argmax to find where largest stretch
    a = np.diff(w).argmax()
    # return original positions of boundary nulls
    return w[[a, a + 1]] + np.array([0, -2])

pir(df.x.values)

array([4, 6])

a = np.array([1, np.nan, 3, np.nan, np.nan, np.nan, np.nan, 8, 7, 5, 2, 5, np.nan, np.nan])
pir(a)

array([ 7, 11])
+3

, (, , , ):

In [19]: df = pd.DataFrame({'time':[1,2,3,4,5,6,7,8],'x':[1,np.NAN,3,np.NAN,8,7,5,np.NAN]})

In [20]: index = df['x'].isnull()

In [21]: df[index].index.values
Out[21]: array([1, 3, 7])
+1

- scipy.ndimage.measurements.label. -. .

import pandas as pd
import numpy as np
from scipy.ndimage.measurements import label
df = pd.DataFrame({'time':[1,2,3,4,5,6,7,8],'x':[1,np.NAN,3,np.NAN,8,7,5,np.NAN]})

Getting the longest stretch without nan

valid_rows = ~df.isnull().any(axis=1)
label, num_feature = label(valid_rows)
label_of_biggest_group =  valid_rows.groupby(label).count().drop(0).argmax()
print df.loc[label == label_of_biggest_group]

Result

   time    x
4     5  8.0
5     6  7.0
6     7  5.0

Note

Label 0 contains background data in our case of values nan, and you need to discard it if your number is greater than or equal to the size of your largest group. num_feature- your number of uniform stretch marks without nan.

+1
source

Source: https://habr.com/ru/post/1665928/


All Articles