Pandas groupby: How to get string concatenation

I have a dataframe like this:

ABC 0 1 0.749065 This 1 2 0.301084 is 2 3 0.463468 a 3 4 0.643961 random 4 1 0.866521 string 5 2 0.120737 ! 

Call

 In [10]: print df.groupby("A")["B"].sum() 

will return

 A 1 1.615586 2 0.421821 3 0.463468 4 0.643961 

Now I would like to do the β€œsame thing” for column β€œC”. Since this column contains rows, sum () does not work (although you might think that it will concatenate rows). I would like to see a list or set of lines for each group, i.e.

 A 1 {This, string} 2 {is, !} 3 {a} 4 {random} 

I tried to find ways to do this.

Series.unique () ( http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html ) does not work though

 df.groupby("A")["B"] 

is an

 pandas.core.groupby.SeriesGroupBy object 

so I was hoping that any method of the series would work. Any ideas?

+108
python pandas
Jul 24 '13 at 17:43
source share
7 answers
 In [4]: df = read_csv(StringIO(data),sep='\s+') In [5]: df Out[5]: ABC 0 1 0.749065 This 1 2 0.301084 is 2 3 0.463468 a 3 4 0.643961 random 4 1 0.866521 string 5 2 0.120737 ! In [6]: df.dtypes Out[6]: A int64 B float64 C object dtype: object 

When you use your own function, non-numeric columns are not automatically excluded. This is slower than applying .sum() to groupby

 In [8]: df.groupby('A').apply(lambda x: x.sum()) Out[8]: ABC A 1 2 1.615586 Thisstring 2 4 0.421821 is! 3 3 0.463468 a 4 4 0.643961 random 

sum merges by default

 In [9]: df.groupby('A')['C'].apply(lambda x: x.sum()) Out[9]: A 1 Thisstring 2 is! 3 a 4 random dtype: object 

You can do basically what you want

 In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x)) Out[11]: A 1 {This, string} 2 {is, !} 3 {a} 4 {random} dtype: object 

Do this for the whole frame, one group at a time. Key must return Series

 def f(x): return Series(dict(A = x['A'].sum(), B = x['B'].sum(), C = "{%s}" % ', '.join(x['C']))) In [14]: df.groupby('A').apply(f) Out[14]: ABC A 1 2 1.615586 {This, string} 2 4 0.421821 {is, !} 3 3 0.463468 {a} 4 4 0.643961 {random} 
+161
Jul 24 '13 at 17:51
source share

You can use the apply method to apply an arbitrary function to grouped data. Therefore, if you want to install a set, use set . If you need a list, use list .

 >>> d AB 0 1 This 1 2 is 2 3 a 3 4 random 4 1 string 5 2 ! >>> d.groupby('A')['B'].apply(list) A 1 [This, string] 2 [is, !] 3 [a] 4 [random] dtype: object 

If you want something else, just write a function that does what you want, and then apply .

+60
Jul 24 '13 at 17:51
source share

You can use the aggregate (or agg ) function to combine the values. (Unverified code)

 df.groupby('A')['B'].agg(lambda col: ''.join(col)) 
+21
Jul 24 '13 at 17:52
source share

You can try this:

 df.groupby('A').agg({'B':'sum','C':'-'.join}) 
+8
Oct 15 '18 at 12:35
source share

a simple solution:

 >>> df.groupby(['A','B']).c.unique().reset_index() 
+7
Jul 29 '16 at 13:21
source share

If you want to overwrite column B in the data frame, this should work:

  df = df.groupby('A',as_index=False).agg(lambda x:'\n'.join(x)) 
+6
Oct 26 '18 at 10:09
source share

Named aggregates with pandas >= 0.25.0

Starting from version 0.25.0 for pandas, we called aggregates in which we can group, aggregate and at the same time assign new names to our columns. Thus, we will not get MultiIndex columns, and the column names make more sense, given the data they contain:




aggregate and get a list of rows

 grp = df.groupby('A').agg(B_sum=('B','sum'), C=('C', list)).reset_index() print(grp) A B_sum C 0 1 1.615586 [This, string] 1 2 0.421821 [is, !] 2 3 0.463468 [a] 3 4 0.643961 [random] 



aggregate and concatenate rows

 grp = df.groupby('A').agg(B_sum=('B','sum'), C=('C', ', '.join)).reset_index() print(grp) A B_sum C 0 1 1.615586 This, string 1 2 0.421821 is, ! 2 3 0.463468 a 3 4 0.643961 random 
+5
Aug 23 '19 at 10:06 on
source share



All Articles