How can I present the information in this DataFrame as a time series?

I have a pandas DataFrame that looks like this:

             start_time             end_time    user
0  2016-12-17 03:10:07   2016-12-17 03:18:10  andrew
1  2016-12-17 03:11:07   2016-12-17 03:15:07   eddie
2  2016-12-17 03:12:08   2016-12-17 03:19:08  andrew  
3  2016-12-17 03:13:08   2016-12-17 03:14:06   eddie
...

Each row represents a job that was sent to a compute cluster. start_time- this is when the scheduled task starts, and end_time- when it is completed.

How can I create a new time-indexed DataFrame that describes how many jobs each user is currently executing?

+4
source share
2 answers

. . - , - ( event). , , , - , - .

, ( 1 ) 0 ( . .min().fillna(0) - frame NAs, , , apply(lambda x:...)

df['event'] = 1
df_starts = df.pivot('start_time', 'user', 'event').fillna(0).resample('1S').min().fillna(0)
df_stops = df.pivot('end_time', 'user', 'event').fillna(0).resample('1S').min().fillna(0)

, . .

full_index = df_starts.index.union(df_stops.index)

df_starts = df_starts.reindex(full_index, fill_value=0)
df_stops = df_stops.reindex(full_index, fill_value=0)

, . 1 , 1 . .cumsum(), .

df_change = df_starts - df_stops
df_running = df_change.cumsum()

df_running, x - .

enter image description here

+1

. , , , . , , .

import pandas as pd
import datetime as dt

#Generate some data
m = 50
n = 2 * m

start_time = [dt.datetime(2016, 12, 17, 3, np.random.randint(0, 59)) for n in range(n)]

df = pd.DataFrame({'start_time': start_time,
                   'end_time': [date + dt.timedelta(0, np.random.randint(0, 3600)) for date in start_time],
                   'user': ['A', 'E'] * (m)})

#Doing the solution 
user_on = (df.ix[:, ['end_time', 'user']]
             .rename(columns={'end_time':'time'})
             .assign(on_off=-1))
user_off = (df.ix[:, ['start_time', 'user']]
              .rename(columns={'start_time':'time'})
              .assign(on_off=1))

df = pd.concat([user_on, user_off]).sort_values(by='time')
df = df.groupby(['time', 'user']).sum()
df = df.unstack().cumsum().fillna(method='ffill')

.

                    on_off
              user  A   E
time        
2016-12-17 03:00:00 1   0
2016-12-17 03:01:00 2   1
2016-12-17 03:02:00 2   2
2016-12-17 03:03:00 4   4
2016-12-17 03:04:00 5   3
2016-12-17 03:06:00 7   4

82 , 10 000 , .

, , ( ), (ons-offs) ( ).

Cusum vs Change in Programs running for User

.. , . , .

+1

Source: https://habr.com/ru/post/1664072/


All Articles