Pandas Multi-Column DataFrame Aggregation Function

Question

Pandas Multi-Column DataFrame Aggregation Function

Is there a way to write an aggregation function that is used in the DataFrame.agg method that will have access to more than one column of data that is aggregated? Typical use cases are weighted average weighted standard deviations.

I would like to write something like

 def wAvg(c, w): return ((c * w).sum() / w.sum()) df = DataFrame(....) # df has columns c and w, i want weighted average # of c using w as weight. df.aggregate ({"c": wAvg}) # and somehow tell it to use w column as weights ...

+48

python pandas

user1444817 Jun 08 2018-12-18T00:

source share

6 answers

Wes McKinney · Answer 1 · 2012-06-09 22:07

Yes; use the .apply(...) function, which will be called on each under the DataFrame . For example:

 grouped = df.groupby(keys) def wavg(group): d = group['data'] w = group['weights'] return (d * w).sum() / w.sum() grouped.apply(wavg)

dslack · Answer 2 · 2015-11-23 21:36

The following (based on Wes McKinney's answer) does exactly what I was looking for. I would be happy to know if there is an easier way to do this in pandas .

 def wavg_func(datacol, weightscol): def wavg(group): dd = group[datacol] ww = group[weightscol] * 1.0 return (dd * ww).sum() / ww.sum() return wavg def df_wavg(df, groupbycol, weightscol): grouped = df.groupby(groupbycol) df_ret = grouped.agg({weightscol:sum}) datacols = [cc for cc in df.columns if cc not in [groupbycol, weightscol]] for dcol in datacols: try: wavg_f = wavg_func(dcol, weightscol) df_ret[dcol] = grouped.apply(wavg_f) except TypeError: # handle non-numeric columns df_ret[dcol] = grouped.agg({dcol:min}) return df_ret

The df_wavg() function returns the data block grouped by the groupby column and returns the sum of weights for the weight column. Other columns are either weighted averages or, if they are not numeric, the min() function is used for aggregation.

santon · Answer 3 · 2016-02-10 23:07

I do this a lot and found the following convenient enough:

 def weighed_average(grp): return grp._get_numeric_data().multiply(grp['COUNT'], axis=0).sum()/grp['COUNT'].sum() df.groupby('SOME_COL').apply(weighed_average)

This will calculate the weighted average of all numeric columns in df and reduce the numeric values.

Nathaniel · Answer 4 · 2016-02-19 18:51

Doing this with groupby(...).apply(...) fails. Here's the solution I use all the time (essentially using kalu logic).

 def grouped_weighted_average(self, values, weights, *groupby_args, **groupby_kwargs): """ :param values: column(s) to take the average of :param weights_col: column to weight on :param group_args: args to pass into groupby (eg the level you want to group on) :param group_kwargs: kwargs to pass into groupby :return: pandas.Series or pandas.DataFrame """ if isinstance(values, str): values = [values] ss = [] for value_col in values: df = self.copy() prod_name = 'prod_{v}_{w}'.format(v=value_col, w=weights) weights_name = 'weights_{w}'.format(w=weights) df[prod_name] = df[value_col] * df[weights] df[weights_name] = df[weights].where(~df[prod_name].isnull()) df = df.groupby(*groupby_args, **groupby_kwargs).sum() s = df[prod_name] / df[weights_name] s.name = value_col ss.append(s) df = pd.concat(ss, axis=1) if len(ss) > 1 else ss[0] return df pandas.DataFrame.grouped_weighted_average = grouped_weighted_average

ErnestScribbler · Answer 5 · 2017-06-21 18:21

My solution is similar to the Nathaniel solution, for only one column, and I do not do a deep copy of the entire data frame every time, which can be prohibitively slow. Performance enhancement over a group of solutions (...). Apply (...) is about 100x (!)

 def weighted_average(df,data_col,weight_col,by_col): df['_data_times_weight'] = df[data_col]*df[weight_col] df['_weight_where_notnull'] = df[weight_col]*pd.notnull(df[data_col]) g = df.groupby(by_col) result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum() del df['_data_times_weight'], df['_weight_where_notnull'] return result

Ted Petrou · Answer 6 · 2017-11-04 18:16

You can return any number of aggregated values from a groupby object with apply . Just return the series and the index values will become the new column names.

Take a look at a quick example:

 df = pd.DataFrame({'group':['a','a','b','b'], 'd1':[5,10,100,30], 'd2':[7,1,3,20], 'weights':[.2,.8, .4, .6]}, columns=['group', 'd1', 'd2', 'weights']) df group d1 d2 weights 0 a 5 7 0.2 1 a 10 1 0.8 2 b 100 3 0.4 3 b 30 20 0.6

Define a custom function to be passed to apply . It implicitly accepts a DataFrame - the value of the data parameter is a DataFrame. Notice how it uses multiple columns, which is not possible using the agg groupby method:

 def weighted_average(data): d = {} d['d1_wa'] = np.average(data['d1'], weights=data['weights']) d['d2_wa'] = np.average(data['d2'], weights=data['weights']) return pd.Series(d)

Call the groupby apply method using our custom function:

 df.groupby('group').apply(weighted_average) d1_wa d2_wa group a 9.0 2.2 b 58.0 13.2

You can get better performance by pre-calculating the weighted totals into the new DataFrame columns, as described in other answers, and don't use apply at all.

Pandas Multi-Column DataFrame Aggregation Function

More articles: