Pandas total counter

Let's say I have a user activity log and I want to generate a report of the total duration and the number of unique users per day.

import numpy as np import pandas as pd df = pd.DataFrame({'date': ['2013-04-01','2013-04-01','2013-04-01','2013-04-02', '2013-04-02'], 'user_id': ['0001', '0001', '0002', '0002', '0002'], 'duration': [30, 15, 20, 15, 30]}) 

The duration of the aggregation is quite simple:

 group = df.groupby('date') agg = group.aggregate({'duration': np.sum}) agg duration date 2013-04-01 65 2013-04-02 45 

What I would like to do is to sum the duration and the number of matches at the same time, but I cannot find the equivalent for count_distinct:

 agg = group.aggregate({ 'duration': np.sum, 'user_id': count_distinct}) 

It works, but of course the best way, no?

 group = df.groupby('date') agg = group.aggregate({'duration': np.sum}) agg['uv'] = df.groupby('date').user_id.nunique() agg duration uv date 2013-04-01 65 2 2013-04-02 45 1 

I think I just need to provide a function that returns the number of individual elements of a Series object in an aggregated function, but I do not have a large number of different libraries at my disposal. Also, it seems that the groupby object already knows this information, so I just wouldn’t duplicate the efforts?

+71
python pandas
01 Sep '13 at 3:25
source share
3 answers

How about this:

 >>> df date duration user_id 0 2013-04-01 30 0001 1 2013-04-01 15 0001 2 2013-04-01 20 0002 3 2013-04-02 15 0002 4 2013-04-02 30 0002 >>> df.groupby("date").agg({"duration": np.sum, "user_id": pd.Series.nunique}) duration user_id date 2013-04-01 65 2 2013-04-02 45 1 >>> df.groupby("date").agg({"duration": np.sum, "user_id": lambda x: x.nunique()}) duration user_id date 2013-04-01 65 2 2013-04-02 45 1 
+122
Sep 01 '13 at 3:31 on
source share
β€” -

'nunique' is now an option for .agg (), therefore:

 df.groupby('date').agg({'duration': 'sum', 'user_id': 'nunique'}) 
+39
Jul 11 '17 at 21:27
source share

By simply adding the answers already provided, the @Blodwyn Pig solution is the most efficient.

This solution seems much faster, tested here on ~ 21M dataframe lines, then grouped to ~ 2M

 %time _=g.agg({"id": lambda x: x.nunique()}) CPU times: user 3min 3s, sys: 2.94 s, total: 3min 6s Wall time: 3min 20s %time _=g.agg({"id": pd.Series.nunique}) CPU times: user 3min 2s, sys: 2.44 s, total: 3min 4s Wall time: 3min 18s %time _=g.agg({"id": 'nunique'}) CPU times: user 14 s, sys: 4.76 s, total: 18.8 s Wall time: 24.4 s 
+11
Sep 27 '17 at 15:53 ​​on
source share



All Articles