Apply function to MultiIndex file frame using pandas / python

I have the following DataFrame to which I want to apply some date range calculations. I want to select rows in a date frame where the date difference between samples for unique individuals (from sample_date) is less than 8 weeks and holds the row with the oldest date (i.e. With the first sample).

Here is an example dataset. The actual data set may exceed 200,000 records.

 labno name sex dob id location sample_date 1 John AM 12/07/1969 12345 A 12/05/2112 2 John BM 10/01/1964 54321 B 6/12/2010 3 James M 30/08/1958 87878 A 30/04/2012 4 James M 30/08/1958 45454 B 29/04/2012 5 Peter M 12/05/1935 33322 C 15/07/2011 6 John AM 12/07/1969 12345 A 14/05/2012 7 Peter M 12/05/1935 33322 A 23/03/2011 8 Jack M 5/12/1921 65655 B 15/08/2011 9 Jill F 6/08/1986 65459 A 16/02/2012 10 Julie F 4/03/1992 41211 C 15/09/2011 11 Angela F 1/10/1977 12345 A 23/10/2006 12 Mark AM 1/06/1955 56465 C 4/04/2011 13 Mark AM 1/06/1955 45456 C 3/04/2011 14 Mark BM 9/12/1984 55544 A 13/09/2012 15 Mark BM 9/12/1984 55544 A 1/01/2012 

Unique people are those who have the same name and dob. For example, John A, James, Mark A and Mark B are unique people. However, Mark A has different id values.

I usually use R for the procedure and generate a list of data frames based on the name / dob combination and sort each dataframe with sample_date. Then I used the list application function to determine if the difference between the date between the fist and the last index in each data frame is different, to return the oldest if it has been less than 8 weeks from the most recent date. It is required forever.

I would welcome a few pointers regarding how I can try this with python / pandas. I started by creating a MultiIndex named / dob / id. The structure looks the way I want. I need to try applying some of the functions that I use in R to select the lines I need. I tried to select using df.xs() , but I'm not very far away.

Here is a data dictionary that loads easily into pandas (albeit with a different column order).

{'dob': {0: '12 / 07/1969 ', 1: '10 / 01/1964', 2: '30 / 08/1958 ', 3: '30 / 08/1958', 4: '12 / 05/1935 ', 5: '12 / 07/1969', 6: '12 / 05/1935 ', 7:' 5/12/1921 ', 8:' 6/08/1986 ', 9:' 4 / 03/1992 ', 10:' 1/10/1977 ', 11:' 1/06/1955 ', 12:' 1/06/1955 ', 13:' 9/12/1984 ', 14:' 9 / 12/1984 '},' id ': {0: 12345, 1: 54321, 2: 87878, 3: 45454,
4: 33322, 5: 12345, 6: 33322, 7: 65655, 8: 65459, 9: 41211, 10: 12345, 11: 56465, 12: 45456, 13: 55544, 14: 55544}, 'labno': { 0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11:12, 12: 13, 13: 14, 14: 15} 'location': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'C', 5: 'A' , 6: "A", 7: "B", 8: "A", 9: "C", 10: "A", 11: "C", 12: "C", 13: 'A', 14 : 'A'}, 'name': {0: 'John A', 1: 'John B', 2: โ€œJamesโ€, 3: โ€œJamesโ€, 4: โ€œPeterโ€, 5: โ€œJohn Aโ€, 6: Peter, 7: Jack, 8: Jill, 9: Julia, 10: Angela, 11: Mark A,
12: "Mark A", 13: "Mark B", 14: "Mark B"), "sample_date": {0: '12 / 05/2112 ', 1:' 6/12/2010 ', 2:' 04/30/2012 ', 3: '29 / 04/2012', 4: '15 / 07/2011 ', 5: '14 / 05/2012', 6: '23 / 03/2011 ', 7:' 08/15/2011 ', 8: '16 / 02/2012', 9: '15 / 09/2011 ', 10: '23 / 10/2006', 11: '4/04/2011', 12: ' 04/04/2011 ', 13: '13 / 09/2012', 14: '1/01/2012'}, 'sex': {0: 'M', 1: 'M', 2: 'M' , 3: 'M', 4: 'M', 5: 'M', 6: 'M', 7: 'M', 8: 'F', 9: 'F',
10: 'F', 11: 'M', 12: 'M', 13: 'M', 14: 'M'}}

+6
source share
1 answer

I think you can search

 def differ(df): delta = df.sample_date.diff().abs() # only care about magnitude cond = delta.notnull() & (delta < np.timedelta64(8, 'W')) return df[cond].max() delta = df.groupby(['dob', 'name']).apply(differ) 

Depending on whether you want to save people who do not have more than one pattern, you can call delta.dropna(how='all') to remove them.

Note that for timedelta64 to work timedelta64 you will need numpy >= 1.7 to compare timedelta64 , as there are a number of issues with timedelta64 / datetime64 for numpy < 1.7 .

+6
source

Source: https://habr.com/ru/post/951436/


All Articles