How to pick a data frame using Pandas based on group criteria?

I have a large dataset with the following structure

User X 1 0 1 0 2 0 2 0 2 1 3 0 3 0 

I would like to take a subset of the data so that the sum of the X column for each user is 0. Given the above example, the subset should only include observations for users 1 and 3 as follows

 User X 1 0 1 0 3 0 3 0 

Is there a way to do this using the groupby function without grouping the data? I want the subset to include separate observations.

+5
source share
2 answers

DSM's answer , which selects rows using a boolean mask, works well even if the DataFrame has unique code. My method, which selects rows using index values, is slightly slower when the index is unique and significantly slower when the index contains duplicate values.

@roland: Please consider accepting a DSM response.


You can use groupby-filter :

 In [16]: df.loc[df.groupby('User')['X'].filter(lambda x: x.sum() == 0).index] Out[16]: User X 0 1 0 1 1 0 5 3 0 6 3 0 

The group filter itself simply returns this:

 In [29]: df.groupby('User')['X'].filter(lambda x: x.sum() == 0) Out[29]: 0 0 1 0 5 0 6 0 Name: X, dtype: int64 

but you can use its index,

 In [30]: df.groupby('User')['X'].filter(lambda x: x.sum() == 0).index Out[30]: Int64Index([0, 1, 5, 6], dtype='int64') 

to select the desired lines using df.loc .


Here are the criteria that I used:

 In [49]: df2 = pd.concat([df]*10000) # df2 has a non-unique index 

I Ctrl - C 'd this one because it ended up taking too long:

 In [50]: %timeit df2.loc[df2.groupby('User')['X'].filter(lambda x: x.sum() == 0).index] 

When I realized my mistake, I created a DataFrame with a unique index:

 In [51]: df3 = df2.reset_index() # this gives df3 a unique index In [52]: %timeit df3.loc[df3.groupby('User')['X'].filter(lambda x: x.sum() == 0).index] 100 loops, best of 3: 13 ms per loop In [53]: %timeit df3.loc[df3.groupby("User")["X"].transform(sum) == 0] 100 loops, best of 3: 11.4 ms per loop 

This shows that the DSM method works well even with a unique index:

 In [54]: %timeit df2.loc[df2.groupby("User")["X"].transform(sum) == 0] 100 loops, best of 3: 11.2 ms per loop 
+5
source

As an alternative to @unutbu's answer, there is also

 >>> df.loc[df.groupby("User")["X"].transform(sum) == 0] User X 0 1 0 1 1 0 5 3 0 6 3 0 

This creates a boolean series of df -length for use as a selector:

 >>> df.groupby("User")["X"].transform(sum) == 0 0 True 1 True 2 False 3 False 4 False 5 True 6 True dtype: bool 

transform used when you want to "translate" the result of a grouping operation by reducing to all elements of each group. It will come in handy.

+8
source

Source: https://habr.com/ru/post/1210762/


All Articles