The best way to get the latest posts from a Pandas frame

Question

The best way to get the latest posts from a Pandas frame

Recently, I had to get the last set status for certain items marked with identifiers. I found this answer: Python: How can I get the rows with the maximum value of the group to which they belong?

To my surprise, in a dataset with only ~ 2e6 rows, this was rather slow. However, I do not need to get all the maximum values, only the latter.

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "id": np.random.randint(1, 1000, size=5000),
    "status": np.random.randint(1, 10, size=5000),
    "date": [
        time.strftime("%Y-%m-%d", time.localtime(time.time() - x))
        for x in np.random.randint(-5e7, 5e7, size=5000)
    ],
})

%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
1 loops, best of 3: 576 ms per loop

%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
100 loops, best of 3: 4.82 ms per loop

The first is the solution I found in the link, which seems like a way that allows you to perform more complex operations.

However, for my problem, I could sort and delete duplicates and reindex, which is much better. Especially in large datasets, this really matters.

: , ? , ?

+4

python numpy pandas

galinden 04 . '15 10:03

1

jakevdp · Accepted Answer · 2015-11-04T12:16:57+0000

- groupby, .

df.iloc[df.groupby('id')['date'].idxmax()]

, -, 5-10 , (. ). , , 'date' , , :

# Timing your original solutions:
%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
# 1 loops, best of 3: 826 ms per loop
%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
# 100 loops, best of 3: 5.1 ms per loop

# convert the date
df['date'] = pd.to_datetime(df['date'])

# new times on your solutions
%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
# 1 loops, best of 3: 815 ms per loop
%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
# 1000 loops, best of 3: 1.99 ms per loop

# my aggregation solution
%timeit df.iloc[df.groupby('id')['date'].idxmax()]
# 10 loops, best of 3: 135 ms per loop

The best way to get the latest posts from a Pandas frame

More articles: