Pandas for Python Grouping

I have a data set consisting of several tuples per timestamp - each of them has an account. Each timestamp may have different tuples. I would like to group them together in 5 minute bins and add counts for each unique tuple. Is there a good clean way to do this using Pandas group-by?

They have the form: ((u'67.163.47.231 ', u'8.27.82.254', 50186, 80, 6, 1377565195000), 2)

This is currently a list, with a 6-tuple (last entry is a timestamp), and then a count.

For each timestamp, 5 tuples will be collected:

(5-tuple), t-timestamp, counting, for example (all at once)

[((u'71.57.43.240', u'8.27.82.254', 33108, 80, 6, 1377565195000), 1), ((u'67.163.47.231', u'8.27.82.254', 50186, 80, 6, 1377565195000), 2), ((u'8.27.82.254', u'98.206.29.242', 25159, 80, 6, 1377565195000), 1), ((u'71.179.102.253', u'8.27.82.254', 50958, 80, 6, 1377565195000), 1)] In [220]: df = DataFrame ( { 'key1' : [ (u'71.57.43.240', u'8.27.82.254', 33108, 80, 6), (u'67.163.47.231', u'8.27.82.254', 50186, 80, 6) ], 'data1' : np.array((1,2)), 'data2': np.array((1377565195000,1377565195000))}) In [226]: df Out[226]: data1 data2 key1 0 1 1377565195000 (71.57.43.240, 8.27.82.254, 33108, 80, 6) 1 2 1377565195000 (67.163.47.231, 8.27.82.254, 50186, 80, 6) 

or converted:

 In [231]: df = DataFrame ( { 'key1' : [ (u'71.57.43.240', u'8.27.82.254', 33108, 80, 6), (u'67.163.47.231', u'8.27.82.254', 50186, 80, 6) ], 'data1' : np.array((1,2)), .....: 'data2': np.array(( datetime.utcfromtimestamp(1377565195),datetime.utcfromtimestamp(1377565195) )) }) In [232]: df Out[232]: data1 data2 key1 0 1 2013-08-27 00:59:55 (71.57.43.240, 8.27.82.254, 33108, 80, 6) 1 2 2013-08-27 00:59:55 (67.163.47.231, 8.27.82.254, 50186, 80, 6) Here a simpler example: time count city 00:00:00 1 Montreal 00:00:00 2 New York 00:00:00 1 Chicago 00:01:00 2 Montreal 00:01:00 3 New York after bin-ing time count city 00:05:00 3 Montreal 00:05:00 5 New York 00:05:00 1 Chicago 

Here seems to work well:

 times = [ parse('00:00:00'), parse('00:00:00'), parse('00:00:00'), parse('00:01:00'), parse('00:01:00'), parse('00:02:00'), parse('00:02:00'), parse('00:03:00'), parse('00:04:00'), parse('00:05:00'), parse('00:05:00'), parse('00:06:00'), parse('00:06:00') ] cities = [ 'Montreal', 'New York', 'Chicago', 'Montreal', 'New York', 'New York', 'Chicago', 'Montreal', 'Montreal', 'New York', 'Chicago', 'Montreal', 'Chicago'] counts = [ 1, 2, 1, 2, 3, 1, 1, 1, 2, 2, 2, 1, 1] frame = DataFrame( { 'city': cities, 'time': times, 'count': counts } ) In [150]: frame Out[150]: city count time 0 Montreal 1 2013-09-07 00:00:00 1 New York 2 2013-09-07 00:00:00 2 Chicago 1 2013-09-07 00:00:00 3 Montreal 2 2013-09-07 00:01:00 4 New York 3 2013-09-07 00:01:00 5 New York 1 2013-09-07 00:02:00 6 Chicago 1 2013-09-07 00:02:00 7 Montreal 1 2013-09-07 00:03:00 8 Montreal 2 2013-09-07 00:04:00 9 New York 2 2013-09-07 00:05:00 10 Chicago 2 2013-09-07 00:05:00 11 Montreal 1 2013-09-07 00:06:00 12 Chicago 1 2013-09-07 00:06:00 frame['time_5min'] = frame['time'].map(lambda x: pd.DataFrame([0],index=pd.DatetimeIndex([x])).resample('5min').index[0]) In [152]: frame Out[152]: city count time time_5min 0 Montreal 1 2013-09-07 00:00:00 2013-09-07 00:00:00 1 New York 2 2013-09-07 00:00:00 2013-09-07 00:00:00 2 Chicago 1 2013-09-07 00:00:00 2013-09-07 00:00:00 3 Montreal 2 2013-09-07 00:01:00 2013-09-07 00:00:00 4 New York 3 2013-09-07 00:01:00 2013-09-07 00:00:00 5 New York 1 2013-09-07 00:02:00 2013-09-07 00:00:00 6 Chicago 1 2013-09-07 00:02:00 2013-09-07 00:00:00 7 Montreal 1 2013-09-07 00:03:00 2013-09-07 00:00:00 8 Montreal 2 2013-09-07 00:04:00 2013-09-07 00:00:00 9 New York 2 2013-09-07 00:05:00 2013-09-07 00:05:00 10 Chicago 2 2013-09-07 00:05:00 2013-09-07 00:05:00 11 Montreal 1 2013-09-07 00:06:00 2013-09-07 00:05:00 12 Chicago 1 2013-09-07 00:06:00 2013-09-07 00:05:00 In [153]: df = frame.groupby(['time_5min', 'city']).aggregate('sum') In [154]: df Out[154]: count time_5min city 2013-09-07 00:00:00 Chicago 2 Montreal 6 New York 6 2013-09-07 00:05:00 Chicago 3 Montreal 1 New York 2 In [155]: df.reset_index(1) Out[155]: city count time_5min 2013-09-07 00:00:00 Chicago 2 2013-09-07 00:00:00 Montreal 6 2013-09-07 00:00:00 New York 6 2013-09-07 00:05:00 Chicago 3 2013-09-07 00:05:00 Montreal 1 2013-09-07 00:05:00 New York 2 
+6
source share
2 answers

If you set the date as an index, you can use TimeGrouper (which allows you to group, for example, with an interval of 5 minutes):

 In [11]: from pandas.tseries.resample import TimeGrouper In [12]: df.set_index('data2', inplace=True) In [13]: g = df.groupby(TimeGrouper('5Min')) 

You can then count the number of unique items for every 5 minute interval using nunique:

 In [14]: g['key1'].nunique() Out[14]: 2013-08-27 00:55:00 2 dtype: int64 

If you are looking for counting each tuple, you can use value_counts:

 In [15]: g['key1'].apply(pd.value_counts) Out[15]: 2013-08-27 00:55:00 (71.57.43.240, 8.27.82.254, 33108, 80, 6) 1 (67.163.47.231, 8.27.82.254, 50186, 80, 6) 1 dtype: int64 

Note: this is a series with MultiIndex (use reset_index to make it a DataFrame).

 In [16]: g['key1'].apply(pd.value_counts).reset_index(1) Out[16]: level_1 0 2013-08-27 00:55:00 (71.57.43.240, 8.27.82.254, 33108, 80, 6) 1 2013-08-27 00:55:00 (67.163.47.231, 8.27.82.254, 50186, 80, 6) 1 

You probably want to give these more informative column names :).

Update: I previously hacked get get_dummies , see change history.

+4
source

If you just want to add the number of samples for each unique tuple, just groupby key1 :

 df.groupby('key1').aggregate('sum') 

If you want to do this for each time step and each unique tuple, you can provide several columns for grouping:

 df.groupby(['data2', 'key1']).aggregate('sum') 

If you need to combine different timestamps in one bin in 5 minutes, you may need to round your timestamp to 5 minutes, and then group:

 df['data2_5min'] = (np.ceil(df['data2'].values.astype('int64')/(5.0*60*1000000000))*(5.0*60*1000000000)).astype('int64').astype('M8[ns]') df.groupby(['data2_5min', 'key1']).aggregate('sum') 

If you want to keep some of the original timestamps (but you must choose if you have broken them), you can specify the function to use in separate columns. For example, take the first:

 df2 = df.groupby(['data2_5min', 'key1']).aggregate({'data1':'sum', 'data2':'first'}) df2.reset_index(0, drop=True).set_index('data2', append=True) 

If you just want to reprogram for 5 minutes and add the number of samples regardless of the keys, you can simply do:

 df.set_index('data2', inplace=True) df.resample('5min', 'sum') 
+1
source

Source: https://habr.com/ru/post/953269/


All Articles