I need to run a function in a large groupby query that checks if the two subgroups have any overlapping dates. The following is an example of one tmp group:
ID num start stop subGroup 0 21 10 2006-10-10 2008-10-03 1 1 21 46 2006-10-10 2100-01-01 2 2 21 5 1997-11-25 1998-09-29 1 3 21 42 1998-09-29 2100-01-01 2 4 21 3 1997-01-07 1997-11-25 1 5 21 6 2006-10-10 2008-10-03 1 6 21 47 1998-09-29 2006-10-10 2 7 21 4 1997-01-07 1998-09-29 1
The function I wrote for this is as follows:
def hasOverlap(tmp): d2_starts = tmp[tmp['subGroup']==2]['start'] d2_stops = tmp[tmp['subGroup']==2]['stop'] return tmp[tmp['subGroup']==1].apply(lambda row_d1: ( #Check for part nested D2 in D1 ((d2_starts >= row_d1['start']) & (d2_starts < row_d1['stop']) ) | ((d2_stops >= row_d1['start']) & (d2_stops < row_d1['stop']) ) | #Check for fully nested D1 in D2 ((d2_stops >= row_d1['stop']) & (d2_starts <= row_d1['start']) ) ).any() ,axis = 1 ).any()
The problem is that this code has a lot of redundancy and when I run the query:
groups.agg(hasOverlap)
It takes an unreasonably long time to complete.
Are there any performance fixes (e.g. using built-in functions or set_index) that I could do to speed this up?
source share