Pandas groupby: How to get string concatenation

Question

Pandas groupby: How to get string concatenation

I have a dataframe like this:

ABC 0 1 0.749065 This 1 2 0.301084 is 2 3 0.463468 a 3 4 0.643961 random 4 1 0.866521 string 5 2 0.120737 !

Call

 In [10]: print df.groupby("A")["B"].sum()

will return

 A 1 1.615586 2 0.421821 3 0.463468 4 0.643961

Now I would like to do the “same thing” for column “C”. Since this column contains rows, sum () does not work (although you might think that it will concatenate rows). I would like to see a list or set of lines for each group, i.e.

 A 1 {This, string} 2 {is, !} 3 {a} 4 {random}

I tried to find ways to do this.

Series.unique () ( http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html ) does not work though

 df.groupby("A")["B"]

is an

 pandas.core.groupby.SeriesGroupBy object

so I was hoping that any method of the series would work. Any ideas?

+108

python pandas

Anne Jul 24 '13 at 17:43

source share

7 answers

You can use the apply method to apply an arbitrary function to grouped data. Therefore, if you want to install a set, use set . If you need a list, use list .

 >>> d AB 0 1 This 1 2 is 2 3 a 3 4 random 4 1 string 5 2 ! >>> d.groupby('A')['B'].apply(list) A 1 [This, string] 2 [is, !] 3 [a] 4 [random] dtype: object

If you want something else, just write a function that does what you want, and then apply .

+60

BrenBarn Jul 24 '13 at 17:51

source share

You can use the aggregate (or agg ) function to combine the values. (Unverified code)

 df.groupby('A')['B'].agg(lambda col: ''.join(col))

+21

voithos Jul 24 '13 at 17:52

source share

You can try this:

 df.groupby('A').agg({'B':'sum','C':'-'.join})

+8

user3241146 Oct 15 '18 at 12:35

source share

a simple solution:

 >>> df.groupby(['A','B']).c.unique().reset_index()

+7

UserYmY Jul 29 '16 at 13:21

source share

If you want to overwrite column B in the data frame, this should work:

  df = df.groupby('A',as_index=False).agg(lambda x:'\n'.join(x))

+6

Amit Oct 26 '18 at 10:09

source share

Named aggregates with `pandas >= 0.25.0`

Starting from version 0.25.0 for pandas, we called aggregates in which we can group, aggregate and at the same time assign new names to our columns. Thus, we will not get MultiIndex columns, and the column names make more sense, given the data they contain:

aggregate and get a list of rows

 grp = df.groupby('A').agg(B_sum=('B','sum'), C=('C', list)).reset_index() print(grp) A B_sum C 0 1 1.615586 [This, string] 1 2 0.421821 [is, !] 2 3 0.463468 [a] 3 4 0.643961 [random]

aggregate and concatenate rows

 grp = df.groupby('A').agg(B_sum=('B','sum'), C=('C', ', '.join)).reset_index() print(grp) A B_sum C 0 1 1.615586 This, string 1 2 0.421821 is, ! 2 3 0.463468 a 3 4 0.643961 random

+5

Erfan Aug 23 '19 at 10:06 on

source share

Jeff · Accepted Answer · 2013-07-24 17:51

 In [4]: df = read_csv(StringIO(data),sep='\s+') In [5]: df Out[5]: ABC 0 1 0.749065 This 1 2 0.301084 is 2 3 0.463468 a 3 4 0.643961 random 4 1 0.866521 string 5 2 0.120737 ! In [6]: df.dtypes Out[6]: A int64 B float64 C object dtype: object

When you use your own function, non-numeric columns are not automatically excluded. This is slower than applying .sum() to groupby

 In [8]: df.groupby('A').apply(lambda x: x.sum()) Out[8]: ABC A 1 2 1.615586 Thisstring 2 4 0.421821 is! 3 3 0.463468 a 4 4 0.643961 random

sum merges by default

 In [9]: df.groupby('A')['C'].apply(lambda x: x.sum()) Out[9]: A 1 Thisstring 2 is! 3 a 4 random dtype: object

You can do basically what you want

 In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x)) Out[11]: A 1 {This, string} 2 {is, !} 3 {a} 4 {random} dtype: object

Do this for the whole frame, one group at a time. Key must return Series

 def f(x): return Series(dict(A = x['A'].sum(), B = x['B'].sum(), C = "{%s}" % ', '.join(x['C']))) In [14]: df.groupby('A').apply(f) Out[14]: ABC A 1 2 1.615586 {This, string} 2 4 0.421821 {is, !} 3 3 0.463468 {a} 4 4 0.643961 {random}

Pandas groupby: How to get string concatenation

Named aggregates with pandas >= 0.25.0

More articles:

Named aggregates with `pandas >= 0.25.0`