List each row for each group in a DataFrame

Question

List each row for each group in a DataFrame

In pandas, how can I add a new column that lists rows based on a given grouping?

For example, suppose the following DataFrame:

import pandas as pd import numpy as np a_list = ['A', 'B', 'C', 'A', 'A', 'C', 'B', 'B', 'A', 'C'] df = pd.DataFrame({'col_a': a_list, 'col_b': range(10)}) df col_a col_b 0 A 0 1 B 1 2 C 2 3 A 3 4 A 4 5 C 5 6 B 6 7 B 7 8 A 8 9 C 9

I would add col_c , which gives me the N-th row of the "group" on the basis of grouping col_a and sorting col_b .

Required Conclusion:

  col_a col_b col_c 0 A 0 1 3 A 3 2 4 A 4 3 8 A 8 4 1 B 1 1 6 B 6 2 7 B 7 3 2 C 2 1 5 C 5 2 9 C 9 3

I am trying to get to col_c . You can proceed to the correct grouping and sorting with .sort_index(by=['col_a', 'col_b']) , now it is a matter of moving to this new column and labeling each row.

+6

python pandas

Greg reda Jun 21 '13 at 5:24

source share

3 answers

These answers include a python function call for each group, and if you have many groups, then the vectorized approach should be faster (I have not tested).

Here is my clean suggestion:

 In [5]: df.sort(['col_a', 'col_b'], inplace=True, ascending=(False, False)) In [6]: sizes = df.groupby('col_a', sort=False).size().values In [7]: df['col_c'] = np.arange(sizes.sum()) - np.repeat(sizes.cumsum() - sizes, sizes) In [8]: print df col_a col_b col_c 9 C 9 0 5 C 5 1 2 C 2 2 7 B 7 0 6 B 6 1 1 B 1 2 8 A 8 0 4 A 4 1 3 A 3 2 0 A 0 3

+2

andrew Mar 2 '15 at 17:41

source share

You can define your own function to handle this:

 In [58]: def func(x): ....: x['col_c'] = x['col_a'].argsort() + 1 ....: return x ....: In [59]: df.groupby('col_a').apply(func) Out[59]: col_a col_b col_c 0 A 0 1 3 A 3 2 4 A 4 3 8 A 8 4 1 B 1 1 6 B 6 2 7 B 7 3 2 C 2 1 5 C 5 2 9 C 9 3

+1

waitingkuo Jun 21 '13 at 9:02

source share

Andy hayden · Accepted Answer · 2013-06-21T08:55:16+0000

Here's the cumcount , in this case:

 df['col_c'] = g.cumcount()

As the docs say:

The number of each element in each group is from 0 to the length of this group - 1.

Original answer (before determining cumcount).

To do this, you can create a helper function:

 def add_col_c(x): x['col_c'] = np.arange(len(x)) return x

First sort by col_a column:

 In [11]: df.sort('col_a', inplace=True)

then apply this function to each group:

 In [12]: g = df.groupby('col_a', as_index=False) In [13]: g.apply(add_col_c) Out[13]: col_a col_b col_c 3 A 3 0 8 A 8 1 0 A 0 2 4 A 4 3 6 B 6 0 1 B 1 1 7 B 7 2 9 C 9 0 2 C 2 1 5 C 5 2

To get 1,2,... , your chickens use np.arange(1, len(x) + 1) .

List each row for each group in a DataFrame

More articles: