How to calculate group time difference with pandas?

Problem

I want to calculate diff by group. And I do not know how to sort the time column so that the results of each group are sorted and positive.

Initial data:

 In [37]: df Out[37]: id time 0 A 2016-11-25 16:32:17 1 A 2016-11-25 16:36:04 2 A 2016-11-25 16:35:29 3 B 2016-11-25 16:35:24 4 B 2016-11-25 16:35:46 

As a result i want

 Out[40]: id time 0 A 00:35 1 A 03:12 2 B 00:22 

: col time type - timedelta64 [ns]

Attempt

 In [38]: df['time'].diff(1) Out[38]: 0 NaT 1 00:03:47 2 -1 days +23:59:25 3 -1 days +23:59:55 4 00:00:22 Name: time, dtype: timedelta64[ns] 

Do not get the desired result.

Hope

Not only can the problem be solved, but the code can work quickly because there are 50 million lines.

+5
source share
1 answer

You can use sort_values with groupby and diff aggregation:

 df['diff'] = df.sort_values(['id','time']).groupby('id')['time'].diff() print (df) id time diff 0 A 2016-11-25 16:32:17 NaT 1 A 2016-11-25 16:36:04 00:00:35 2 A 2016-11-25 16:35:29 00:03:12 3 B 2016-11-25 16:35:24 NaT 4 B 2016-11-25 16:35:46 00:00:22 

If you need to remove rows with NaT in the diff column, use dropna :

 df = df.dropna(subset=['diff']) print (df) id time diff 2 A 2016-11-25 16:35:29 00:03:12 1 A 2016-11-25 16:36:04 00:00:35 4 B 2016-11-25 16:35:46 00:00:22 

You can also overwrite the column:

 df.time = df.sort_values(['id','time']).groupby('id')['time'].diff() print (df) id time 0 A NaT 1 A 00:00:35 2 A 00:03:12 3 B NaT 4 B 00:00:22 

 df.time = df.sort_values(['id','time']).groupby('id')['time'].diff() df = df.dropna(subset=['time']) print (df) id time 1 A 00:00:35 2 A 00:03:12 4 B 00:00:22 
+12
source

Source: https://habr.com/ru/post/1260346/


All Articles