Pandas, Get a counter of a single value in a Dataframe column

Using pandas, I would like to get a counter of a specific value in a column. I know that using df.somecolumn.ravel () will give me all the unique values ​​and their number. But how to get the amount of some specific value.

In[5]:df Out[5]: col 1 1 1 1 2 2 2 1 

Desired:

  To get count of 1. In[6]:df.somecalulation(1) Out[6]: 5 To get count of 2. In[6]:df.somecalulation(2) Out[6]: 3 
+5
source share
2 answers

You can try value_counts :

 df = df['col'].value_counts().reset_index() df.columns = ['col', 'count'] print df col count 0 1 5 1 2 3 

EDIT:

 print (df['col'] == 1).sum() 5 

Or:

 def somecalulation(x): return (df['col'] == x).sum() print somecalulation(1) 5 print somecalulation(2) 3 

Or:

 ser = df['col'].value_counts() def somecalulation(s, x): return s[x] print somecalulation(ser, 1) 5 print somecalulation(ser, 2) 3 

EDIT2:

If you need something really fast, use numpy.in1d :

 import pandas as pd import numpy as np a = pd.Series([1, 1, 1, 1, 2, 2]) #for testing len(a) = 6000 a = pd.concat([a]*1000).reset_index(drop=True) print np.in1d(a,1).sum() 4000 print (a == 1).sum() 4000 print np.sum(a==1) 4000 

Delay

len(a)=6 :

 In [131]: %timeit np.in1d(a,1).sum() The slowest run took 9.17 times longer than the fastest. This could mean that an intermediate result is being cached 10000 loops, best of 3: 29.9 µs per loop In [132]: %timeit np.sum(a == 1) 10000 loops, best of 3: 196 µs per loop In [133]: %timeit (a == 1).sum() 1000 loops, best of 3: 180 µs per loop 

len(a)=6000 :

 In [135]: %timeit np.in1d(a,1).sum() The slowest run took 7.29 times longer than the fastest. This could mean that an intermediate result is being cached 10000 loops, best of 3: 48.5 µs per loop In [136]: %timeit np.sum(a == 1) The slowest run took 5.23 times longer than the fastest. This could mean that an intermediate result is being cached 1000 loops, best of 3: 273 µs per loop In [137]: %timeit (a == 1).sum() 1000 loops, best of 3: 271 µs per loop 
+9
source

If you take value_counts return, you can request it for several values:

 import pandas as pd a = pd.Series([1, 1, 1, 1, 2, 2]) counts = a.value_counts() >>> counts[1], counts[2] (4, 2) 

However, to count only one element, it would be faster to use

 import numpy as np np.sum(a == 1) 
+2
source

Source: https://habr.com/ru/post/1245257/


All Articles