Pandas: store strings if at least one of them contains a specific value

I have the following data file in Pandas

letter  number
------ -------
a       2
a       0
b       1
b       5
b       2
c       1
c       0
c       2

I would like to keep all lines if at least one matching number is 0. The result would be:

letter  number
------ -------
a       2
a       0
c       1
c       0
c       2

since b does not have a corresponding number equal to 0

What is the best way to do this? Thank!

+4
source share
2 answers

You need filtration :

df = df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
print (df)
  letter  number
0      a       2
1      a       0
5      c       1
6      c       0
7      c       2

Another solution with transformwhere to get the 0row size and filter boolean indexing:

print (df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()))
0    1
1    1
2    0
3    0
4    0
5    1
6    1
7    1
Name: number, dtype: int64

df = df[df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()) > 0]
print (df)
  letter  number
0      a       2
1      a       0
5      c       1
6      c       0
7      c       2

EDIT:

Faster is not used groupby, better locwith : isin

df1 = df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
print (df1)
  letter  number
0      a       2
1      a       0
5      c       1
6      c       0
7      c       2

Comparison with another solution:

In [412]: %timeit df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 815 µs per loop

In [413]: %timeit df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
1000 loops, best of 3: 657 µs per loop
+5
source

groupby, , , isin. , :

>>> letters_to_keep = df[df['number'] == 0]['letter']
>>> df_reduced = df[df['letter'].isin(letters_to_keep)]
>>> df_reduced
  letter  number
0      a       2
1      a       0
5      c       1
6      c       0
7      c       2

, , groupby, , , ! timeit , :

>>> %%timeit
... df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
100 loops, best of 3: 2.26 ms per loop

>>> %%timeit
... df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 820 µs per loop
+3

Source: https://habr.com/ru/post/1673886/


All Articles