I have some data that I want to group by a specific column, and then combine a series of fields based on a time window from a group.
Here are some sample data:
df = spark.createDataFrame([Row(date='2016-01-01', group_by='group1', get_avg=5, get_first=1),
Row(date='2016-01-10', group_by='group1', get_avg=5, get_first=2),
Row(date='2016-02-01', group_by='group2', get_avg=10, get_first=3),
Row(date='2016-02-28', group_by='group2', get_avg=20, get_first=3),
Row(date='2016-02-29', group_by='group2', get_avg=30, get_first=3),
Row(date='2016-04-02', group_by='group2', get_avg=8, get_first=4)])
I want to group with group_byand then create time windows that start from the earliest date and last up to 30 days without recording for this group. After these 30 days have ended, the next time window will start from the date of the next line that did not fall into the previous window.
Then I want to aggregate, for example, to get the average value get_avgand the first result get_first.
So the output for this example should be:
group_by first date of window get_avg get_first
group1 2016-01-01 5 1
group2 2016-02-01 20 3
group2 2016-04-02 8 4
: , , . , 30 . 2.