List each row for each group in a DataFrame

In pandas, how can I add a new column that lists rows based on a given grouping?

For example, suppose the following DataFrame:

import pandas as pd import numpy as np a_list = ['A', 'B', 'C', 'A', 'A', 'C', 'B', 'B', 'A', 'C'] df = pd.DataFrame({'col_a': a_list, 'col_b': range(10)}) df col_a col_b 0 A 0 1 B 1 2 C 2 3 A 3 4 A 4 5 C 5 6 B 6 7 B 7 8 A 8 9 C 9 

I would add col_c , which gives me the N-th row of the "group" on the basis of grouping col_a and sorting col_b .

Required Conclusion:

  col_a col_b col_c 0 A 0 1 3 A 3 2 4 A 4 3 8 A 8 4 1 B 1 1 6 B 6 2 7 B 7 3 2 C 2 1 5 C 5 2 9 C 9 3 

I am trying to get to col_c . You can proceed to the correct grouping and sorting with .sort_index(by=['col_a', 'col_b']) , now it is a matter of moving to this new column and labeling each row.

+6
source share
3 answers

Here's the cumcount , in this case:

 df['col_c'] = g.cumcount() 

As the docs say:

The number of each element in each group is from 0 to the length of this group - 1.


Original answer (before determining cumcount).

To do this, you can create a helper function:

 def add_col_c(x): x['col_c'] = np.arange(len(x)) return x 

First sort by col_a column:

 In [11]: df.sort('col_a', inplace=True) 

then apply this function to each group:

 In [12]: g = df.groupby('col_a', as_index=False) In [13]: g.apply(add_col_c) Out[13]: col_a col_b col_c 3 A 3 0 8 A 8 1 0 A 0 2 4 A 4 3 6 B 6 0 1 B 1 1 7 B 7 2 9 C 9 0 2 C 2 1 5 C 5 2 

To get 1,2,... , your chickens use np.arange(1, len(x) + 1) .

+13
source

These answers include a python function call for each group, and if you have many groups, then the vectorized approach should be faster (I have not tested).

Here is my clean suggestion:

 In [5]: df.sort(['col_a', 'col_b'], inplace=True, ascending=(False, False)) In [6]: sizes = df.groupby('col_a', sort=False).size().values In [7]: df['col_c'] = np.arange(sizes.sum()) - np.repeat(sizes.cumsum() - sizes, sizes) In [8]: print df col_a col_b col_c 9 C 9 0 5 C 5 1 2 C 2 2 7 B 7 0 6 B 6 1 1 B 1 2 8 A 8 0 4 A 4 1 3 A 3 2 0 A 0 3 
+2
source

You can define your own function to handle this:

 In [58]: def func(x): ....: x['col_c'] = x['col_a'].argsort() + 1 ....: return x ....: In [59]: df.groupby('col_a').apply(func) Out[59]: col_a col_b col_c 0 A 0 1 3 A 3 2 4 A 4 3 8 A 8 4 1 B 1 1 6 B 6 2 7 B 7 3 2 C 2 1 5 C 5 2 9 C 9 3 
+1
source

Source: https://habr.com/ru/post/947805/


All Articles