Pandas DataFrame - Get hours over duration

Question

Pandas DataFrame - Get hours over duration

I am a relative newbie to pandas and I am not sure how to approach this. I analyze the flow of tickets through the help system. Raw data looks like this (with a large number of columns, and sometimes with closed days):

TicketNo SvcGroup CreatedAt ClosedAt 0 4237941 Unix 2013-07-28 03:55:00 2013-07-28 11:01:37.346438 1 4238041 Windows 2013-07-28 04:59:00 2013-07-28 18:25:02.193182 2 4238051 Windows 2013-07-28 05:09:00 2013-07-28 23:11:12.003673 3 4238291 Windows 2013-07-28 05:10:00 2013-07-28 05:32:51.547251 4 4238321 Unix 2013-07-28 01:15:00 2013-07-28 10:09:20 5 4238331 Unix 2013-07-28 01:53:00 2013-07-28 17:42:56.192088 6 4238561 Windows 2013-07-28 02:03:00 2013-07-28 06:34:09.455042 7 4238691 Windows 2013-07-28 02:03:00 2013-07-28 20:54:47.306731 8 4238811 Windows 2013-07-28 03:23:00 2013-07-28 13:15:20.823505 9 4238851 Windows 2013-07-28 04:16:00 2013-07-28 23:51:55.561463 10 4239011 Unix 2013-07-28 04:26:00 2013-07-28 09:27:06.275342 11 4239041 Windows 2013-07-28 04:38:00 2013-07-28 07:55:34.416621 12 4239131 Unix 2013-07-28 08:15:00 2013-07-28 08:46:42.380739 13 4239141 Windows 2013-07-28 01:08:00 2013-07-28 15:37:12.266341

I want to see the data by the clock, see how the tickets go through the help desk using a shift, so the intermediate step could be something like this:

  Opened Open Closed CarryFwd TicketNo SvcGroup Hour 4237941 Unix 3 1 1 0 1 4 0 1 0 1 5 0 1 0 1 6 0 1 0 1 7 0 1 0 1 8 0 1 0 1 9 0 1 0 1 10 0 1 0 1 11 0 1 1 0 4239041 Windows 4 1 1 0 1 5 0 1 0 1 6 0 1 0 1 7 0 1 1 0

With a final result similar to (from the grouping above):

  Opened Closed CarryFwd SvcGroup Hour Unix 3 6 7 47 4 7 10 44 5 1 6 39 6 11 2 48 7 7 3 52 8 5 5 52 9 5 11 46 Windows 3 6 7 22 4 3 10 15 5 5 2 18 6 6 2 22 7 11 11 22 8 2 4 20 9 0 2 18

Note: this is broken by the clock, but I could take a look at it in the afternoon, week, etc. As soon as I get to the above, I can tell if the service group is gaining, lags, etc.

Any ideas on how to approach this? The part I really can't understand is how to use the BuiltAt function for ClosedAt and break it down into discrete time intervals (hours, etc.) ...

Any recommendations are appreciated. Thanks.

+2

python pandas

James haskell Sep 08 '13 at 1:38

source share

2 answers

Here is another way ...

Create a function that takes a row and creates the following corresponding DataFrame:

 def sparse_opened_closed(row): opened_hour, closed_hour = row['CreatedAt'].hour, row['ClosedAt'].hour hours = xrange(opened_hour, closed_hour + 1) index = pd.MultiIndex.from_tuples((row['TicketNo'], row['SvcGroup'], h) for h in hours]) opened, closed = np.zeros_like(hours), np.zeros_like(hours) opened[0], closed[-1] = 1, 1 open, carry = np.ones_like(hours), np.ones_like(hours) carry[-1] = 0 return pd.DataFrame({'Opened': opened, 'Open': open, 'Closed': closed, 'CarryFwd': carry}, index=index)

You could make it more efficient.

Now, iterate over each of the lines and concat:

 In [11]: pd.concat(sparse_opened_closed(row) for _, row in df.iterrows()).head(10) Out[11]: CarryFwd Closed Open Opened 4237941 Unix 3 1 0 1 1 4 1 0 1 0 5 1 0 1 0 6 1 0 1 0 7 1 0 1 0 8 1 0 1 0 9 1 0 1 0 10 1 0 1 0 11 0 1 1 0 4238041 Windows 4 1 0 1 1

0

Andy hayden 10 sept. '13 at 7:25

source share

Jeff · Accepted Answer · 2013-09-08T16:33:08+0000

This is only a partial answer.

Read in your data, the note should have combined 2 columns of date / time

 In [75]: df = read_csv(StringIO(data),sep='\s+',skiprows=1,parse_dates=[[3,4],[5,6]],header=None) In [76]: df.columns = ['created','closed','idx','num','typ'] In [77]: df Out[77]: created closed idx num typ 0 2013-07-28 03:55:00 2013-07-28 11:01:37.346438 0 4237941 Unix 1 2013-07-28 04:59:00 2013-07-28 18:25:02.193182 1 4238041 Windows 2 2013-07-28 05:09:00 2013-07-28 23:11:12.003673 2 4238051 Windows 3 2013-07-28 05:10:00 2013-07-28 05:32:51.547251 3 4238291 Windows 4 2013-07-28 01:15:00 2013-07-28 10:09:20 4 4238321 Unix 5 2013-07-28 01:53:00 2013-07-28 17:42:56.192088 5 4238331 Unix 6 2013-07-28 02:03:00 2013-07-28 06:34:09.455042 6 4238561 Windows 7 2013-07-28 02:03:00 2013-07-28 20:54:47.306731 7 4238691 Windows 8 2013-07-28 03:23:00 2013-07-28 13:15:20.823505 8 4238811 Windows 9 2013-07-28 04:16:00 2013-07-28 23:51:55.561463 9 4238851 Windows 10 2013-07-28 04:26:00 2013-07-28 09:27:06.275342 10 4239011 Unix 11 2013-07-28 04:38:00 2013-07-28 07:55:34.416621 11 4239041 Windows 12 2013-07-28 08:15:00 2013-07-28 08:46:42.380739 12 4239131 Unix 13 2013-07-28 01:08:00 2013-07-28 15:37:12.266341 13 4239141 Windows In [78]: df.dtypes Out[78]: created datetime64[ns] closed datetime64[ns] idx int64 num int64 typ object dtype: object

For each even, put 1 where it is in the range (created-closed). Fill nano with 0.

 In [82]: m = df.apply(lambda x: Series(1,index=np.arange(x['created'].hour,x['closed'].hour+1)),axis=1).fillna(0) In [81]: m Out[81]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 6 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 8 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 9 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

Attach it to the source dataset and set the index

In [83]: y = df [['num', 'typ']]. join (m) .set_index (['num', 'typ'])

 In [84]: y Out[84]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 num typ 4237941 Unix 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 4238041 Windows 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 4238051 Windows 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4238291 Windows 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4238321 Unix 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 4238331 Unix 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 4238561 Windows 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4238691 Windows 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 4238811 Windows 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 4238851 Windows 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4239011 Unix 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4239041 Windows 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4239131 Unix 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4239141 Windows 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

At this point you can do the calculations

Open / Closed - direct edge detection. Carry Fwd is just m.where(m==1)

Pandas DataFrame - Get hours over duration

More articles: