How to pick a data frame using Pandas based on group criteria?

Question

How to pick a data frame using Pandas based on group criteria?

I have a large dataset with the following structure

User X 1 0 1 0 2 0 2 0 2 1 3 0 3 0

I would like to take a subset of the data so that the sum of the X column for each user is 0. Given the above example, the subset should only include observations for users 1 and 3 as follows

 User X 1 0 1 0 3 0 3 0

Is there a way to do this using the groupby function without grouping the data? I want the subset to include separate observations.

+5

python pandas

roland Jan 9 '15 at 19:44

source share

2 answers

As an alternative to @unutbu's answer, there is also

 >>> df.loc[df.groupby("User")["X"].transform(sum) == 0] User X 0 1 0 1 1 0 5 3 0 6 3 0

This creates a boolean series of df -length for use as a selector:

 >>> df.groupby("User")["X"].transform(sum) == 0 0 True 1 True 2 False 3 False 4 False 5 True 6 True dtype: bool

transform used when you want to "translate" the result of a grouping operation by reducing to all elements of each group. It will come in handy.

+8

DSM Jan 9 '15 at 20:04

source share

unutbu · Accepted Answer · 2015-01-09T19:52:42+0000

DSM's answer , which selects rows using a boolean mask, works well even if the DataFrame has unique code. My method, which selects rows using index values, is slightly slower when the index is unique and significantly slower when the index contains duplicate values.

@roland: Please consider accepting a DSM response.

You can use groupby-filter :

 In [16]: df.loc[df.groupby('User')['X'].filter(lambda x: x.sum() == 0).index] Out[16]: User X 0 1 0 1 1 0 5 3 0 6 3 0

The group filter itself simply returns this:

 In [29]: df.groupby('User')['X'].filter(lambda x: x.sum() == 0) Out[29]: 0 0 1 0 5 0 6 0 Name: X, dtype: int64

but you can use its index,

 In [30]: df.groupby('User')['X'].filter(lambda x: x.sum() == 0).index Out[30]: Int64Index([0, 1, 5, 6], dtype='int64')

to select the desired lines using df.loc .

Here are the criteria that I used:

 In [49]: df2 = pd.concat([df]*10000) # df2 has a non-unique index

I Ctrl - C 'd this one because it ended up taking too long:

 In [50]: %timeit df2.loc[df2.groupby('User')['X'].filter(lambda x: x.sum() == 0).index]

When I realized my mistake, I created a DataFrame with a unique index:

 In [51]: df3 = df2.reset_index() # this gives df3 a unique index In [52]: %timeit df3.loc[df3.groupby('User')['X'].filter(lambda x: x.sum() == 0).index] 100 loops, best of 3: 13 ms per loop In [53]: %timeit df3.loc[df3.groupby("User")["X"].transform(sum) == 0] 100 loops, best of 3: 11.4 ms per loop

This shows that the DSM method works well even with a unique index:

 In [54]: %timeit df2.loc[df2.groupby("User")["X"].transform(sum) == 0] 100 loops, best of 3: 11.2 ms per loop

How to pick a data frame using Pandas based on group criteria?

More articles: