Pandas multiple line date check

I need to run a function in a large groupby query that checks if the two subgroups have any overlapping dates. The following is an example of one tmp group:

  ID num start stop subGroup 0 21 10 2006-10-10 2008-10-03 1 1 21 46 2006-10-10 2100-01-01 2 2 21 5 1997-11-25 1998-09-29 1 3 21 42 1998-09-29 2100-01-01 2 4 21 3 1997-01-07 1997-11-25 1 5 21 6 2006-10-10 2008-10-03 1 6 21 47 1998-09-29 2006-10-10 2 7 21 4 1997-01-07 1998-09-29 1 

The function I wrote for this is as follows:

 def hasOverlap(tmp): d2_starts = tmp[tmp['subGroup']==2]['start'] d2_stops = tmp[tmp['subGroup']==2]['stop'] return tmp[tmp['subGroup']==1].apply(lambda row_d1: ( #Check for part nested D2 in D1 ((d2_starts >= row_d1['start']) & (d2_starts < row_d1['stop']) ) | ((d2_stops >= row_d1['start']) & (d2_stops < row_d1['stop']) ) | #Check for fully nested D1 in D2 ((d2_stops >= row_d1['stop']) & (d2_starts <= row_d1['start']) ) ).any() ,axis = 1 ).any() 

The problem is that this code has a lot of redundancy and when I run the query:

 groups.agg(hasOverlap) 

It takes an unreasonably long time to complete.

Are there any performance fixes (e.g. using built-in functions or set_index) that I could do to speed this up?

+3
source share
1 answer

Do you just want to return "True" or "False" depending on the presence of overlap? If so, I would just get a list of dates for each subgroup and then use the pandas isin method to check if they overlap.

You can try something like this:

 #split subgroups into separate DF's group1 = groups[groups.subgroup==1] group2 = groups[groups.subgroup==2] #check if any of the start dates from group 2 are in group 1 if len(group1[group1.start.isin(list(group2.start))]) >0: print "Group1 overlaps group2" #check if any of the start dates from group 1 are in group 2 if len(group2[group2.start.isin(list(group1.start))]) >0: print "Group2 overlaps group1" 
0
source

Source: https://habr.com/ru/post/1270745/


All Articles