Multiple results for each individual row (one to many) with Pandas

If I have DataFramewhere each row is an individual and each separate column attribute, how can I get a new DataFrameone that displays each person in multiple results?

I tried to do this with DataFrame.apply()one that seems most intuitive, but it gives exceptions, as in the example below. Adding broadcast=Falseor reduce=Falsenot helping.

The following is a trivial example, obviously, but consider any scenario in which each line maps to multiple lines. What is the best way to handle this? In fact, each row can display different results. This is basically a one-to-many relationship calculation.

Example : I have a data set DataFramewith the following structure, where I want, for each person, to get three upcoming birthdays (a trivial example, I know). So from:

+---+-------+------------+
|   | name  | birthdate  |
+---+-------+------------+
| 1 | John  | 1990-01-01 |
| 2 | Jane  | 1957-04-03 |
| 3 | Max   | 1987-02-03 |
| 4 | David | 1964-02-12 |
+---+-------+------------+

to something like:

+-------+------------+
| name  | birthday   |
+-------+------------+
| John  | 2016-01-01 |
| John  | 2017-01-01 |
| John  | 2018-01-01 |
| Jane  | 2016-04-03 |
| Jane  | 2017-04-03 |
| Jane  | 2018-04-03 |
| Max   | 2016-02-03 |
| Max   | 2017-02-03 |
| Max   | 2018-02-03 |
| David | 2016-02-12 |
| David | 2017-02-12 |
| David | 2018-02-12 |
+-------+------------+

Intuitively, I would try to do something like this:

def get_birthdays(person):
    birthdays = []
    for year in range(2016, 2019):
        birthdays.append({
            'name': person.name,
            'birthday': person.birthdate.replace(year=year)
        })

    return pd.DataFrame(birthdays)

# with data as my original DataFrame
data.apply(get_birthdays, axis=1)

However, this increases:

ValueError: could not broadcast input array from shape (3,2) into shape (3)

During handling of the above exception, another exception occurred:

[...]

ValueError: cannot copy sequence with size 2 to array axis with dimension 3
+4
source share
1 answer

The version groupby applysupports DataFrameas a return value in the form in which you planned:

import pandas as pd
from datetime import datetime

df = pd.DataFrame({
    'name': ['John', 'Jane', 'Max', 'David'],
    'birthdate': [datetime(1990,1,1), datetime(1957,4,3), datetime(1987,2,3), datetime(1964,2,12)],
})

def get_birthdays(df_x):
    d = {'name': [], 'birthday': []}
    name = df_x.iloc[0]['name']
    original = df_x.iloc[0]['birthdate']
    for year in range(2016, 2019):
        d['name'].append(name)
        d['birthday'].append(original.replace(year=year))
    return pd.DataFrame(d)

print df.groupby('name', group_keys=False).apply(get_birthdays).reset_index(drop=True)

Output:

     birthday   name
0  2016-02-12  David
1  2017-02-12  David
2  2018-02-12  David
3  2016-04-03   Jane
4  2017-04-03   Jane
5  2018-04-03   Jane
6  2016-01-01   John
7  2017-01-01   John
8  2018-01-01   John
9  2016-02-03    Max
10 2017-02-03    Max
11 2018-02-03    Max
+4
source

Source: https://habr.com/ru/post/1606604/


All Articles