The pythonic way of filtering data with overlapping dates

Question

The pythonic way of filtering data with overlapping dates

I have this data structure in which each team has a list of problems with start and end dates.

For each team, I would like to combine problems with the same key and overlapping dates, when as a result of the problem the start date will be less, the date and end date will be larger.

I am trying to do this with a few for loops, but I was wondering what would be the best Pythonic way to do this.

Update

I want to combine only problems with the same key in one command and with overlapping dates.
The problems are not in chronological order.

Input:

 { 'Team A': [{ 'start': '11/Jul/13 1:49 PM', 'end': '10/Oct/13 5:16 PM', 'issue': 'KEY-12678' }, { 'start': '3/Oct/13 10:40 AM', 'end': '11/Nov/13 1:02 PM', 'issue': 'KEY-12678' }], 'Team B': [{ 'start': '5/Sep/13 3:35 PM', 'end': '08/Nov/13 3:35 PM', 'issue': 'KEY-12679' }, { 'start': '19/Aug/13 5:05 PM', 'end': '10/Sep/13 5:16 PM', 'issue': 'KEY-12679' }, { 'start': '09/Jul/13 9:15 AM', 'end': '29/Jul/13 9:15 AM', 'issue': 'KEY-12680' }] }

Output:

 { 'Team A': [{ 'start': '11/Jul/13 1:49 PM', 'end': '11/Nov/13 1:02 PM', 'issue': 'KEY-12678' }], 'Team B': [{ 'start': '19/Aug/13 5:05 PM', 'end': '08/Nov/13 3:35 PM', 'issue': 'KEY-12679' }, { 'start': '09/Jul/13 9:15 AM', 'end': '29/Jul/13 9:15 AM', 'issue': 'KEY-12680' }] }

To analyze the date, here is the date format (to save a couple of minutes):

 date_format = "%d/%b/%y %H:%M %p"

Update, new test data

Enter

 d = { "N/A": [ {'start': '23/Jun/14 8:48 PM', 'end': '01/Aug/14 11:00 PM', 'issue': 'KEY-12157'} ,{'start': '09/Jul/13 1:57 PM', 'end': '29/Jul/13 1:57 PM', 'issue': 'KEY-12173'} ,{'start': '21/Aug/13 12:29 PM', 'end': '02/Dec/13 6:06 PM', 'issue': 'KEY-12173'} ,{'start': '17/Feb/14 3:17 PM', 'end': '18/Feb/14 5:51 PM', 'issue': 'KEY-12173'} ,{'start': '12/May/14 4:42 PM', 'end': '02/Jun/14 4:42 PM', 'issue': 'KEY-12173'} ,{'start': '24/Jun/14 11:33 AM', 'end': '01/Aug/14 11:49 AM', 'issue': 'KEY-12173'} ,{'start': '07/Oct/14 1:17 PM', 'end': '17/Nov/14 10:30 AM', 'issue': 'KEY-12173'} ,{'start': '31/Mar/15 1:58 PM', 'end': '12/May/15 4:26 PM', 'issue': 'KEY-12173'} ,{'start': '15/Jul/14 10:06 AM', 'end': '15/Sep/14 5:25 PM', 'issue': 'KEY-12173'} ,{'start': '06/Jan/15 10:46 AM', 'end': '26/Jan/15 10:46 AM', 'issue': 'KEY-20628'} ,{'start': '18/Nov/14 5:08 PM', 'end': '16/Feb/15 1:31 PM', 'issue': 'KEY-20628'} ,{'start': '02/Oct/13 12:32 PM', 'end': '21/Oct/13 5:32 PM', 'issue': 'KEY-12146'} ,{'start': '11/Mar/14 12:08 PM', 'end': '31/Mar/14 12:08 PM', 'issue': 'KEY-12681'} ]}

Output

 {'start': '18/Nov/14 05:08 AM', 'issue': 'KEY-20628', 'end': '16/Feb/15 01:31 AM'} {'start': '09/Jul/13 1:57 PM', 'issue': 'KEY-12173', 'end': '29/Jul/13 1:57 PM'} {'start': '21/Aug/13 12:29 PM', 'issue': 'KEY-12173', 'end': '02/Dec/13 6:06 PM'} {'start': '17/Feb/14 3:17 PM', 'issue': 'KEY-12173', 'end': '18/Feb/14 5:51 PM'} {'start': '12/May/14 4:42 PM', 'issue': 'KEY-12173', 'end': '02/Jun/14 4:42 PM'} {'start': '24/Jun/14 11:33 AM', 'issue': 'KEY-12173', 'end': '15/Sep/14 05:25 AM'} {'start': '07/Oct/14 1:17 PM', 'issue': 'KEY-12173', 'end': '17/Nov/14 10:30 AM'} {'start': '31/Mar/15 1:58 PM', 'issue': 'KEY-12173', 'end': '12/May/15 4:26 PM'} {'start': '11/Mar/14 12:08 PM', 'issue': 'KEY-12681', 'end': '31/Mar/14 12:08 PM'} {'start': '23/Jun/14 8:48 PM', 'issue': 'KEY-12157', 'end': '01/Aug/14 11:00 PM'} {'start': '02/Oct/13 12:32 PM', 'issue': 'KEY-12146', 'end': '21/Oct/13 5:32 PM'}

+6

python date dictionary readability

ThePavolC May 05, '15 at 15:28

source share

3 answers

Kasramvd · Answer 1 · 2015-05-05T16:11:07+0000

You can convert your string dates to real python datetime objects using the format '%d/%b/%y %H:%M %p' and datetime.strftime and use itertools.groupby to group sub dictionaries based on the issue key, and you can loop into ziped groups and extract that max and min with max and min functions with the corresponding key function:

 from datetime import datetime from itertools import groupby from operator import itemgetter new={} for key in d: for dic in [ zip(*[i.items() for i in g]) for _,g in groupby(d[key],itemgetter('issue'))] : temp={} for p,t in [zip(*tup) for tup in dic]: val=p[0] if val=='start': temp[val]=min(t,key=lambda x:datetime.strptime(x,'%d/%b/%y %H:%M %p')) elif val=='end': temp[val]=max(t,key=lambda x:datetime.strptime(x,'%d/%b/%y %H:%M %p')) else: temp[val]=t[0] new.setdefault(key,[]).append(temp) print new

Result:

 {'Team A': [{'start': '11/Jul/13 1:49 PM', 'end': '11/Nov/13 1:02 PM', 'issue': 'KEY-12678'}], 'Team B': [{'start': '19/Aug/13 5:05 PM', 'end': '08/Nov/13 3:35 PM', 'issue': 'KEY-12679'}, {'start': '09/Jul/13 9:15 AM', 'end': '29/Jul/13 9:15 AM', 'issue': 'KEY-12680'}]}

ThePavolC · Answer 2 · 2015-05-06T09:36:10+0000

Here is my current code that seems to work (it's a bit hard to verify).

In my code, I use names like epic and mr , where each line in the epic data samples, but the release key refers to mr .

 from datetime import datetime date_format = "%d/%b/%y %H:%M %p" d = {"team" : [... sample data ...]} def get_list_of_mrs(epics): mrs = set() for epic in epics: mrs.add(epic['issue']) return mrs def is_overlap(epic1, epic2): start1 = datetime.strptime(epic1['start'], date_format) end1 = datetime.strptime(epic1['end'], date_format) start2 = datetime.strptime(epic2['start'], date_format) end2 = datetime.strptime(epic2['end'], date_format) return ((start1 <= end2) and (end1 >= start2)) def get_overlapping_dates(epic1, epic2): start1 = datetime.strptime(epic1['start'], date_format) end1 = datetime.strptime(epic1['end'], date_format) start2 = datetime.strptime(epic2['start'], date_format) end2 = datetime.strptime(epic2['end'], date_format) return (min(start1, start2), max(end1, end2)) def remove_overlaps(epics): filtered_epics = [] for epic in epics: for temp_epic in epics: if temp_epic == epic: continue if epic.has_key('overlap'): continue if is_overlap(epic, temp_epic): temp_epic['overlap'] = True new_start, new_end = get_overlapping_dates(epic, temp_epic) epic['start'] = new_start.strftime(date_format) epic['end'] = new_end.strftime(date_format) filtered_epics.append(epic) filtered_epics = filter(lambda x: not x.has_key('overlap'), filtered_epics) return filtered_epics for team in d: epics = d[team] epics.sort(key=lambda x: datetime.strptime(x['start'], date_format)) uniq_mrs_in_team = get_list_of_mrs(epics) filtered_mrs = [] for mr in uniq_mrs_in_team: mr_epics = filter(lambda x: x['issue'] == mr, epics) filtered = remove_overlaps(mr_epics) #print team, mr, len(mr_epics), len(filtered) for x_mr in mr_epics: #print " -",x_mr pass for x_mr in filtered: #print " +",x_mr pass filtered_mrs.extend(filtered) d[team] = filtered_mrs

Condla · Answer 3 · 2015-06-13T22:06:47+0000

I present the pandas solution mentioned in the aquavitae comments which contains the following steps:

Read the data from the d dictionary that you provided in the DataFrame.
Convert the start and end column to datetime objects.
Sort data by key and by start date and reset index
Loop over the data frame (inefficient, but so far I could not think of something better) and compare the end time of the current line with the start time of the next line, as well as if the keys are equal.
Request pandas data frame to get overlapping rows
Flip the rows to be unloaded and merge the data into the corresponding overlapping row.
Drop these lines.
Convert back to dictionary format.

It looks like this:

 import pandas as pd import numpy as np df = pd.DataFrame(d['N/A']) df['end'] = pd.to_datetime(df['end']) df['start'] = pd.to_datetime(df['start']) df.sort(['issue', 'start'], inplace=True) df.index = range(len(df)) time_overlaps = df[:-1]['end'] > df[1:]['start'] same_issue = df[:-1]['issue'] == df[1:]['issue'] rows_to_drop = np.logical_and(time_overlaps, same_issue) rows_to_drop_indices = [i+1 for i, j in enumerate(rows_to_drop) if j] for i in rows_to_drop_indices: df.loc[i-1, 'end'] = df.loc[i, 'end'] df.drop(rows_to_drop_indices, inplace=True)

If you do not want to save the DataFrame object and perform further calculations in the format specified in your question, follow these steps:

 df.to_dict('records')

EDIT: Found an effective way to do this!

The pythonic way of filtering data with overlapping dates

Update

Update, new test data

More articles: