Get a new data frame with last rows for each user

I have a big data frame that looks like this:

Id last_item_bought time
'user1' 'bike'  2018-01-01
'user3' 'spoon' 2018-01-01
'user2' 'car'   2018-01-01
'user1' 'spoon' 2018-01-02
'user2' 'bike'  2018-01-02
'user3' 'paper' 2018-01-03

Each user has 0 or 1 row per day.

I want a Dataframe with unique users and the last entry of the last article:

Id last_item_bought time   
'user1' 'spoon' 
'user2' 'bike'  
'user3' 'paper'

The data is saved in a file per day, which leads me to two two possible starting points:

  • Load all the data into a dask array and then somehow filter the rows with users who have new entries.
  • Iterate over the days from the most recent to the oldest, load every day into the pandas Dataframe and somehow and add only users to the new data frame that do not have new records (not yet in the new framework).

. , .

+4
1

, sort_values + drop_duplicates:

df = df.sort_values(['Id','time']).drop_duplicates('Id', keep='last')
print (df)
        Id last_item_bought        time
3  'user1'          'spoon'  2018-01-02
4  'user2'           'bike'  2018-01-02
5  'user3'          'paper'  2018-01-03

:

df = df.sort_values(['Id','time']).drop_duplicates('Id', keep='last').drop('time', axis=1)
print (df)
        Id last_item_bought
3  'user1'          'spoon'
4  'user2'           'bike'
5  'user3'          'paper'

Dask ( set_index):

df = pd.DataFrame({'Id': ['user1', 'user3', 'user2', 'user1', 'user2', 'user3'],
                   'time': ['2018-01-01', '2018-01-01', '2018-01-01', 
                            '2018-01-02', '2018-01-02', '2018-01-03'], 
                  'last_item_bought': ['bike', 'spoon', 'car', 'spoon', 'bike', 'paper']})
df['time'] = pd.to_datetime(df['time'])
print (df)
      Id last_item_bought       time
0  user1             bike 2018-01-01
1  user3            spoon 2018-01-01
2  user2              car 2018-01-01
3  user1            spoon 2018-01-02
4  user2             bike 2018-01-02
5  user3            paper 2018-01-03

from dask import dataframe as dd 
ddf = dd.from_pandas(df, npartitions=3)

ddf1 = (ddf.set_index('time')
          .drop_duplicates(subset=['Id'], keep='last')
          .set_index('Id')
          .reset_index()
          .compute())
print (ddf1)
      Id last_item_bought
0  user1            spoon
1  user2             bike
2  user3            paper
+2

Source: https://habr.com/ru/post/1694225/


All Articles