DSM's answer , which selects rows using a boolean mask, works well even if the DataFrame has unique code. My method, which selects rows using index values, is slightly slower when the index is unique and significantly slower when the index contains duplicate values.
@roland: Please consider accepting a DSM response.
You can use groupby-filter :
In [16]: df.loc[df.groupby('User')['X'].filter(lambda x: x.sum() == 0).index] Out[16]: User X 0 1 0 1 1 0 5 3 0 6 3 0
The group filter itself simply returns this:
In [29]: df.groupby('User')['X'].filter(lambda x: x.sum() == 0) Out[29]: 0 0 1 0 5 0 6 0 Name: X, dtype: int64
but you can use its index,
In [30]: df.groupby('User')['X'].filter(lambda x: x.sum() == 0).index Out[30]: Int64Index([0, 1, 5, 6], dtype='int64')
to select the desired lines using df.loc .
Here are the criteria that I used:
In [49]: df2 = pd.concat([df]*10000)
I Ctrl - C 'd this one because it ended up taking too long:
In [50]: %timeit df2.loc[df2.groupby('User')['X'].filter(lambda x: x.sum() == 0).index]
When I realized my mistake, I created a DataFrame with a unique index:
In [51]: df3 = df2.reset_index()
This shows that the DSM method works well even with a unique index:
In [54]: %timeit df2.loc[df2.groupby("User")["X"].transform(sum) == 0] 100 loops, best of 3: 11.2 ms per loop