Pandas dataframe: how to count the number of 1 row in a binary column?

I have the following pandas DataFrame:

import pandas as pd import numpy as np df = pd.DataFrame({"first_column": [0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]}) >>> df first_column 0 0 1 0 2 0 3 1 4 1 5 1 6 0 7 0 8 1 9 1 10 0 11 0 12 0 13 0 14 1 15 1 16 1 17 1 18 1 19 0 20 0 

first_column is a binary column of 0s and 1s. There are “clusters” of consecutive ones that are always in pairs of at least two.

My goal is to create a column that "counts" the number of rows for each group:

 >>> df first_column counts 0 0 0 1 0 0 2 0 0 3 1 3 4 1 3 5 1 3 6 0 0 7 0 0 8 1 2 9 1 2 10 0 0 11 0 0 12 0 0 13 0 0 14 1 5 15 1 5 16 1 5 17 1 5 18 1 5 19 0 0 20 0 0 

This sounds like a job for df.loc() , for example. df.loc[df.first_column == 1] ... something

I'm just not sure how to take into account each individual “cluster” of them and how to flag each of the unique clusters with a “row count”.

How to do it?

+5
source share
3 answers

Here's one approach with NumPy cumsum and bincount -

 def cumsum_bincount(a): # Append 0 & look for a [0,1] pattern. Form a binned array based off 1s groups ids = a*(np.diff(np.r_[0,a])==1).cumsum() # Get the bincount, index into the count with ids and finally mask out 0s return a*np.bincount(ids)[ids] 

Run Example -

 In [88]: df['counts'] = cumsum_bincount(df.first_column.values) In [89]: df Out[89]: first_column counts 0 0 0 1 0 0 2 0 0 3 1 3 4 1 3 5 1 3 6 0 0 7 0 0 8 1 2 9 1 2 10 0 0 11 0 0 12 0 0 13 0 0 14 1 5 15 1 5 16 1 5 17 1 5 18 1 5 19 0 0 20 0 0 

Set the first 6 elems as 1s and then check -

 In [101]: df.first_column.values[:5] = 1 In [102]: df['counts'] = cumsum_bincount(df.first_column.values) In [103]: df Out[103]: first_column counts 0 1 6 1 1 6 2 1 6 3 1 6 4 1 6 5 1 6 6 0 0 7 0 0 8 1 2 9 1 2 10 0 0 11 0 0 12 0 0 13 0 0 14 1 5 15 1 5 16 1 5 17 1 5 18 1 5 19 0 0 20 0 0 
+3
source
  • Since first_column is binary, I can use astype(bool) to get True / False
  • If I take the opposite, and cumsum , I get a convenient way to combine True or 1 s
  • Then i groupby and count with transform
  • transform translates aggregation of count by source index
  • First I use where to group all 0 .
  • I use where again to set their counters to 0
  • I use assign to create a copy of df with a new column. This is because I do not want to compress the df that we already have. If you want to write df directly, use df['counts'] = c

 t = df.first_column.astype(bool) c = df.groupby((~t).cumsum().where(t, -1)).transform('count').where(t, 0) df.assign(counts=c) first_column counts 0 0 0 1 0 0 2 0 0 3 1 3 4 1 3 5 1 3 6 0 0 7 0 0 8 1 2 9 1 2 10 0 0 11 0 0 12 0 0 13 0 0 14 1 5 15 1 5 16 1 5 17 1 5 18 1 5 19 0 0 20 0 0 
+4
source

Here is another approach with pandas groupby, which I find quite readable. A (possible) advantage is that it does not rely on the assumption that only 1 and 0 are present in the column.

The main idea is to create groups of consecutive values, and then just calculate their length. We also carried information about the value in the group, so we can filter zeros.

 # Relevant column -> grouper needs to be 1-Dimensional col_vals = df['first_column'] # Group by sequence of consecutive values and value in the sequence. grouped = df.groupby(((col_vals!=col_vals.shift(1)).cumsum(), col_vals)) # Get the length of consecutive values if they are different from zero, else zero df['counts'] = grouped['first_column'].transform(lambda group: len(group))\ .where(col_vals!=0, 0) 

Here's what the groups and keys look like:

 for key, group in grouped: print key, group (1, 0) first_column 0 0 1 0 2 0 (2, 1) first_column 3 1 4 1 5 1 (3, 0) first_column 6 0 7 0 (4, 1) first_column 8 1 9 1 (5, 0) first_column 10 0 11 0 12 0 13 0 (6, 1) first_column 14 1 15 1 16 1 17 1 18 1 (7, 0) first_column 19 0 20 0 
0
source

Source: https://habr.com/ru/post/1267147/


All Articles