Using description () with weighted data - mean, standard deviation, median, quantile

I am new to python and pandas (from using SAS as my analytic platform for the workhorse), so I apologize in advance if this has already been asked / answered. (I looked through the documentation, as well as this site, which was looking for an answer and still could not find anything.)

I have a dataframe (called resp) that contains the survey data of the respondents. I want to complete some basic descriptive statistics for one of the fields (called anninc [short for year income]).

resp["anninc"].describe() 

Which gives me the basic statistics:

 count 76310.000000 mean 43455.874862 std 33154.848314 min 0.000000 25% 20140.000000 50% 34980.000000 75% 56710.000000 max 152884.330000 dtype: float64 

But there is a catch. Given how this sample was constructed, it was necessary to adjust the weights for the respondents' data so that each of them was not considered β€œequal” in the analysis. I have another column in the data frame (tufnwgrp), which is the weight that should be applied to each record during analysis.

In my previous SAS life, most proc has options for processing data with such weights. For example, a standard uniprocessor process to get the same results would look something like this:

 proc univariate data=resp; var anninc; output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count; run; 

And the same analysis using weighted data will look something like this:

 proc univariate data=resp; var anninc; weight tufnwgrp; output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count run; 

Is there a similar weighting option available in pandas for methods like describe (), etc.

+7
source share
2 answers

There is a statistical and econometric library (statsmodels) that seems to handle this. Here is an example that extends @MSeifert, answering here on a similar question.

 df=pd.DataFrame({ 'x':range(1,101), 'wt':range(1,101) }) from statsmodels.stats.weightstats import DescrStatsW wdf = DescrStatsW(df.x, weights=df.wt, ddof=1) print( wdf.mean ) print( wdf.std ) print( wdf.quantile([0.25,0.50,0.75]) ) 

 67.0 23.6877840059 p 0.25 50 0.50 71 0.75 87 

I do not use SAS, but this gives the same answer as the stata command:

 sum x [fw=wt], detail 

Actually, Stata has several weight options, in which case it gives a slightly different answer if you specify aw (analytical weights) instead of fw (frequency weights). In addition, stata requires fw be an integer, while DescrStatsW allows non-integer weights. Scales are harder than you think ... This starts to get into the weeds, but there is a lot of discussion about weight issues to calculate the standard deviation.

Also note that DescrStatsW does not include functions for min and max, but as long as your weights are not zero, this should not be a problem, since weights do not affect min and max. However, if you have zero weights, it might be nice to have weighted minimum and maximum values, but they are also easy to calculate in pandas:

 df.x[ df.wt > 0 ].min() df.x[ df.wt > 0 ].max() 
+2
source

As @TomAugspuger says:

 In[29]: df = DataFrame(randn(5, 3), columns=list('abc')) column_of_interest = 'a' weights = Series(rand(len(df[column_of_interest])), name=column_of_interest) weights 0 0.840 1 0.486 2 0.452 3 0.316 4 0.720 Name: a, dtype: float64 In[33]: weighted = weights * df[column_of_interest] weighted 0 -1.400 1 -0.163 2 0.262 3 0.274 4 -1.163 Name: a, dtype: float64 In[34]: weighted.describe() count 5.000 mean -0.438 std 0.794 min -1.400 25% -1.163 50% -0.163 75% 0.262 max 0.274 dtype: float64 
0
source

Source: https://habr.com/ru/post/949615/


All Articles