Aggregate and populate data for missing dates (days and hours)

Question

Aggregate and populate data for missing dates (days and hours)

Suppose we have a list showing the amount of each object on a specific date (mm-dd-yyyy-hour-minute):

A = [ [ ['07-07-2012-21-04', 'orange', 1], ['08-16-2012-08-57', 'orange', 1], ['08-18-2012-03-30', 'orange', 1], ['08-18-2012-03-30', 'orange', 1], ['08-19-2012-03-58', 'orange', 1], ['08-19-2012-03-58', 'orange', 1], ['08-19-2012-04-09', 'orange', 1], ['08-19-2012-04-09', 'orange', 1], ['08-19-2012-05-21', 'orange', 1], ['08-19-2012-05-21', 'orange', 1], ['08-19-2012-06-03', 'orange', 1], ['08-19-2012-07-51', 'orange', 1], ['08-19-2012-08-17', 'orange', 1], ['08-19-2012-08-17', 'orange', 1] ], [ ['07-07-2012-21-04', 'banana', 1] ], [ ['07-07-2012-21-04', 'mango', 1], ['08-16-2012-08-57', 'mango', 1], ['08-18-2012-03-30', 'mango', 1], ['08-18-2012-03-30', 'mango', 1], ['08-19-2012-03-58', 'mango', 1], ['08-19-2012-03-58', 'mango', 1], ['08-19-2012-04-09', 'mango', 1], ['08-19-2012-04-09', 'mango', 1], ['08-19-2012-05-21', 'mango', 1], ['08-19-2012-05-21', 'mango', 1], ['08-19-2012-06-03', 'mango', 1], ['08-19-2012-07-51', 'mango', 1], ['08-19-2012-08-17', 'mango', 1], ['08-19-2012-08-17', 'mango', 1] ]

]

I need to do in A to fill in all the missing dates (from the minimum date to the maximum date A) for each object with a value as 0. When the missing dates and their corresponding values (0) are turned on, I want to sum the values for each date so that no the date did not repeat - for each sublist.

Now, I'm trying to do the following: I split the dates and values separately (in lists with the names u and v) and convert each sublist to a series of pandas and allocate the corresponding indexes to them. Therefore, for each zip (u, v):

 def generate(values, indices): indices = flatten(indices) date_index = DatetimeIndex(indices) ts = Series(values, index=date_index) ts.reindex(date_range(min(date_index), max(date_index))) return ts

But here redefinition raises an exception. What I'm looking for is a purely pythonic way (without pandas) that is completely based on list comprehension, or perhaps even in numpy arrays.

There is another problem of aggregation for several hours, which means that if all the dates are the same, and only the hours are different, I want to fill in all the missing hours of the day, and then repeat the same aggregation process for each hour, with missing hours, filled with 0 values.

Thanks in advance.

+4

python numpy pandas

khan Aug 28 '13 at 2:48

source share

1 answer

paluh · Accepted Answer · 2013-09-01T13:24:24+0000

How about this:

 from collections import defaultdict, OrderedDict from datetime import datetime, timedelta from itertools import chain, groupby flat = sorted((datetime.strptime(d, '%m-%d-%Y-%H-%M').date(), f, c) for (d, f, c) in chain(*A)) counts = [(d, f, sum(e[2] for e in l)) for (d, f), l in groupby(flat, key=lambda t: (t[0], t[1]))] # lets assume that there are some data start = counts[0][0] end = counts[-1][0] result = OrderedDict((start+timedelta(days=i), defaultdict(int)) for i in range((end-start).days+1)) for day, data in groupby(counts, key=lambda d: d[0]): result[day].update((f, c) for d, f, c in data)

My question is: can we really fill in non-existing dates - I can easily imagine a situation where it will be a lot of data, even a dangerous amount of data ... I think it is better to use simple common functions and a generator if you want them somewhere transfer:

 from collections import defaultdict from datetime import datetime, timedelta from itertools import chain, groupby def aggregate(data, resolution='daily'): assert resolution in ['hourly', 'daily'] if resolution == 'hourly': round_dt = lambda dt: dt.replace(minute=0, second=0, microsecond=0) else: round_dt = lambda dt: dt.date() flat = sorted((round_dt(datetime.strptime(d, '%m-%d-%Y-%H-%M')), f, c) for (d, f, c) in chain(*A)) counts = [(d, f, sum(e[2] for e in l)) for (d, f), l in groupby(flat, key=lambda t: (t[0], t[1]))] result = {} for day, data in groupby(counts, key=lambda d: d[0]): d = result[day] = defaultdict(int) d.update((f, c) for d, f, c in data) return result def xaggregate(data, resolution='daily'): aggregated = aggregate(data, resolution) curr = min(aggregated.keys()) end = max(aggregated.keys()) interval = timedelta(days=1) if resolution == 'daily' else timedelta(seconds=3600) while curr <= end: # None is sensible value in case of missing data I think yield curr, aggregated.get(curr) curr += interval

In general, my suggestion is that you should not use lists as ordered structures (I mean ['07-07-2012-21-04', 'mango', 1] ). I think tuple more suitable for this purpose, and of course collections.namedtuple even more desirable.

Aggregate and populate data for missing dates (days and hours)

More articles: