Pandas multiple line date check

Question

Pandas multiple line date check

I need to run a function in a large groupby query that checks if the two subgroups have any overlapping dates. The following is an example of one tmp group:

  ID num start stop subGroup 0 21 10 2006-10-10 2008-10-03 1 1 21 46 2006-10-10 2100-01-01 2 2 21 5 1997-11-25 1998-09-29 1 3 21 42 1998-09-29 2100-01-01 2 4 21 3 1997-01-07 1997-11-25 1 5 21 6 2006-10-10 2008-10-03 1 6 21 47 1998-09-29 2006-10-10 2 7 21 4 1997-01-07 1998-09-29 1

The function I wrote for this is as follows:

 def hasOverlap(tmp): d2_starts = tmp[tmp['subGroup']==2]['start'] d2_stops = tmp[tmp['subGroup']==2]['stop'] return tmp[tmp['subGroup']==1].apply(lambda row_d1: ( #Check for part nested D2 in D1 ((d2_starts >= row_d1['start']) & (d2_starts < row_d1['stop']) ) | ((d2_stops >= row_d1['start']) & (d2_stops < row_d1['stop']) ) | #Check for fully nested D1 in D2 ((d2_stops >= row_d1['stop']) & (d2_starts <= row_d1['start']) ) ).any() ,axis = 1 ).any()

The problem is that this code has a lot of redundancy and when I run the query:

 groups.agg(hasOverlap)

It takes an unreasonably long time to complete.

Are there any performance fixes (e.g. using built-in functions or set_index) that I could do to speed this up?

+3

performance python pandas

Timy Apr 15 '16 at 18:14

source share

1 answer

Sam · Answer 1 · 2016-04-15T21:42:21+0000

Do you just want to return "True" or "False" depending on the presence of overlap? If so, I would just get a list of dates for each subgroup and then use the pandas isin method to check if they overlap.

You can try something like this:

 #split subgroups into separate DF's group1 = groups[groups.subgroup==1] group2 = groups[groups.subgroup==2] #check if any of the start dates from group 2 are in group 1 if len(group1[group1.start.isin(list(group2.start))]) >0: print "Group1 overlaps group2" #check if any of the start dates from group 1 are in group 2 if len(group2[group2.start.isin(list(group1.start))]) >0: print "Group2 overlaps group1"

Pandas multiple line date check

More articles: