Pandas: why pandas.Series.std () is different from numpy.std ()

Another update: resolved (see Comments and my own answer).

Update: this is what I am trying to explain.

>>> pd.Series([7,20,22,22]).std() 7.2284161474004804 >>> np.std([7,20,22,22]) 6.2599920127744575 

Answer: this is due to the Bessel correction , N-1 instead of N in the denominator of the standard deviation formula. I would like the Pandas to use the same convention as numpy.


There is a related discussion here , but their suggestions do not work.

I have data on many different restaurants. Here is my data frame (imagine more than one restaurant, but the effect is reproduced only by one):

 >>> df restaurant_id price id 1 10407 7 3 10407 20 6 10407 22 13 10407 22 

Question: r.mi.groupby('restaurant_id')['price'].mean() returns r.mi.groupby('restaurant_id')['price'].mean() prices for each restaurant. I want to get standard deviations. However, r.mi.groupby('restaurant_id')['price'].std() returns invalid values .

As you can see, for the sake of simplicity, I only got one restaurant with four dishes. I want to find the standard deviation of the price. Just to make sure:

 >>> np.mean([7,20,22,22]) 17.75 >>> np.std([7,20,22,22]) 6.2599920127744575 

We can get the same (correct) values ​​with

 >>> np.mean(df) restaurant_id 10407.00 price 17.75 dtype: float64 >>> np.std(df) restaurant_id 0.000000 price 6.259992 dtype: float64 

(Of course, np.std(df) does not np.std(df) attention to the average identifier of a restaurant.) Obviously, np.std(df) not a solution if I have more than one restaurant. So I use groupby .

 >>> df.groupby('restaurant_id').agg('std') price restaurant_id 10407 7.228416 

What kind?! 7.228416 is not 6.259992.

Let's try again.

 >>> df.groupby('restaurant_id').std() 

Same.

 >>> df.groupby('restaurant_id')['price'].std() 

Same.

 >>> df.groupby('restaurant_id').apply(lambda x: x.std()) 

Same.

However, this works:

 for id, group in df.groupby('restaurant_id'): print id, np.std(group['price']) 

Question : is there a correct way to aggregate data, so will I get a new time series with standard deviations for each restaurant?

+8
source share
2 answers

I see. Pandas uses the default Bessel correction - that is, the standard deviation formula with N-1 instead of N in the denominator. As behzad.nouri noted in the comments,

 pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22]) 
+18
source

By default, the ddof argument is set to 0 in numpy and 1 in pandas. This parameter takes into account the Bessel amendment, as explained in the documentation for the pandas .

-1
source

Source: https://habr.com/ru/post/1201952/


All Articles