Determine the average value of "data, where the maximum number of CONTINUOUS cond = True

Question

Determine the average value of "data, where the maximum number of CONTINUOUS cond = True

I have a pandas Dataframe with a column of 'data' and 'cond' (- ition). I need the average value (data column) of the rows with the most CONTINUOUS True objects in 'cond'.

    Example DataFrame:

        cond  data
    0   True  0.20
    1  False  0.30
    2   True  0.90
    3   True  1.20
    4   True  2.30
    5  False  0.75
    6   True  0.80

    Result = 1.466, which is the mean value of row-indexes 2:4 with 3 True

I was not able to find a “vectorized” solution using the groupby or pivot method, so I wrote func that loops through the lines. Unfortunately, this takes about an hour per 1 million lines, which is a long way. Unfortunately, the @jit decoration does not reduce the duration of the measured quantity.

The data I want to analyze is a monitoring project for one year, and I have a DataFrame with one million rows every 3 hours. Thus, about 3,000 of these files.

. numpy.

+4

performance python numpy pandas

Stivi B 16 . '16 20:36

2

pandas:

df['data'].groupby((df['cond'] != df['cond'].shift()).cumsum()).agg(['count', 'mean'])[lambda x: x['count']==x['count'].max()]
Out: 
      count      mean
cond                 
3         3  1.466667

0.18.0, :

res = df['data'].groupby((df['cond'] != df['cond'].shift()).cumsum()).agg(['count', 'mean'])

res[res['count'] == res['count'].max()]
Out: 
      count      mean
cond                 
3         3  1.466667

:

, df['cond'] != df['cond'].shift() :

df['cond'] != df['cond'].shift()
Out: 
0     True
1     True
2     True
3    False
4    False
5     True
6     True
Name: cond, dtype: bool

, False, , . , , () :

(df['cond'] != df['cond'].shift()).cumsum()
Out: 
0    1
1    2
2    3
3    3
4    3
5    4
6    5
Name: cond, dtype: int32

groupby ( , ), . .agg(['count', 'mean'] , , .

, False . True, :

((df['cond'] != df['cond'].shift()) | (df['cond'] != True)).cumsum()

False, True, " OR not True". , :

df['data'].groupby(((df['cond'] != df['cond'].shift()) | (df['cond'] != True)).cumsum()).agg(['count', 'mean'])[lambda x: x['count']==x['count'].max()]

+3

ayhan 16 . '16 20:50

Divakar · Accepted Answer · 2016-10-16T20:47:01+0000

NumPy -

# Extract the relevant cond column as a 1D NumPy array and pad with False at
# either ends, as later on we would try to find the start (rising edge) 
# and stop (falling edge) for each interval of True values
arr = np.concatenate(([False],df.cond.values,[False]))

# Determine the rising and falling edges as start and stop 
start = np.nonzero(arr[1:] > arr[:-1])[0]
stop = np.nonzero(arr[1:] < arr[:-1])[0]

# Get the interval lengths and determine the largest interval ID
maxID = (stop - start).argmax()

# With maxID get max interval range and thus get mean on the second col
out = df.data.iloc[start[maxID]:stop[maxID]].mean()

-

def pandas_based(df): # @ayhan soln
    res = df['data'].groupby((df['cond'] != df['cond'].shift()).\
                                cumsum()).agg(['count', 'mean'])
    return res[res['count'] == res['count'].max()]

def numpy_based(df):
    arr = np.concatenate(([False],df.cond.values,[False]))
    start = np.nonzero(arr[1:] > arr[:-1])[0]
    stop = np.nonzero(arr[1:] < arr[:-1])[0]
    maxID = (stop - start).argmax()
    return df.data.iloc[start[maxID]:stop[maxID]].mean()

-

In [208]: # Setup dataframe
     ...: N = 1000  # Datasize
     ...: df = pd.DataFrame(np.random.rand(N),columns=['data'])
     ...: df['cond'] = np.random.rand(N)>0.3 # To have 70% True values
     ...: 

In [209]: %timeit pandas_based(df)
100 loops, best of 3: 2.61 ms per loop

In [210]: %timeit numpy_based(df)
1000 loops, best of 3: 215 µs per loop

In [211]: # Setup dataframe
     ...: N = 10000  # Datasize
     ...: df = pd.DataFrame(np.random.rand(N),columns=['data'])
     ...: df['cond'] = np.random.rand(N)>0.3 # To have 70% True values
     ...: 

In [212]: %timeit pandas_based(df)
100 loops, best of 3: 4.12 ms per loop

In [213]: %timeit numpy_based(df)
1000 loops, best of 3: 331 µs per loop

Determine the average value of "data, where the maximum number of CONTINUOUS cond = True

More articles: