Calculate weighted average with pandas dataframe

Question

Calculate weighted average with pandas dataframe

I have the following pandas dataframe:

data_df = pd.DataFrame({'ind':['la','p','la','la','p','g','g','la'], 'dist':[10.,5.,7.,8.,7.,2.,5.,3.], 'diff':[0.54,3.2,8.6,7.2,2.1,1.,3.5,4.5], 'cas':[1.,2.,3.,4.,5.,6.,7.,8.]})

i.e

  cas diff dist ind 0 1 0.54 10 la 1 2 3.20 5 p 2 3 8.60 7 la 3 4 7.20 8 la 4 5 2.10 7 p 5 6 1.00 2 g 6 7 3.50 5 g 7 8 4.50 3 la

I need to calculate the weighted average of all the columns where the weights are in the "dist" column and group the values with "ind".

For example, for 'ind' = 'la' and the column 'diff':

 ((10*0.54)+(8.60*7)+(7.20*8)+(4.50*3))/(10+7+8+3) = 4.882143

The result I want is the following

  cas diff ind g 6.714286 2.785714 la 3.107143 4.882143 p 3.750000 2.558333

which is obtained by multiplying each value of each column by the corresponding value in the "dist" column, summarize the results with the same "ind", and then divide the result by the sum of all the "dist" values corresponding to the same index.

I thought it would be an easy task performed by the dataframe groupby method, but in fact it is quite complicated.

Can anyone help me out?

+5

python pandas weighted-average

Cecilia Nov 11 '15 at 18:55

source share

1 answer

behzad.nouri · Accepted Answer · 2015-11-11T19:04:21+0000

You can get within the normalized group weights using transform :

 >>> df['weight'] = df['dist'] / df.groupby('ind')['dist'].transform('sum') >>> df['weight'] 0 0.357143 1 0.416667 2 0.250000 3 0.285714 4 0.583333 5 0.285714 6 0.714286 7 0.107143 Name: weight, dtype: float64

Then you just need to multiply this weight by the values and take the sum:

 >>> df['wcas'], df['wdiff'] = (df[n] * df['weight'] for n in ('cas', 'diff')) >>> df.groupby('ind')[['wcas', 'wdiff']].sum() wcas wdiff ind g 6.714286 2.785714 la 3.107143 4.882143 p 3.750000 2.558333

Edit: using mutation in place:

 >>> backup = df.copy() # make a backup copy to mutate in place >>> cols = df.columns[:2] # cas, diff >>> df[cols] = df['weight'].values[:, None] * df[cols] >>> df.groupby('ind')[cols].sum() cas diff ind g 6.714286 2.785714 la 3.107143 4.882143 p 3.750000 2.558333

Calculate weighted average with pandas dataframe

More articles: