Another update: resolved (see Comments and my own answer).
Update: this is what I am trying to explain.
>>> pd.Series([7,20,22,22]).std() 7.2284161474004804 >>> np.std([7,20,22,22]) 6.2599920127744575
Answer: this is due to the Bessel correction , N-1 instead of N in the denominator of the standard deviation formula. I would like the Pandas to use the same convention as numpy.
There is a related discussion here , but their suggestions do not work.
I have data on many different restaurants. Here is my data frame (imagine more than one restaurant, but the effect is reproduced only by one):
>>> df restaurant_id price id 1 10407 7 3 10407 20 6 10407 22 13 10407 22
Question: r.mi.groupby('restaurant_id')['price'].mean() returns r.mi.groupby('restaurant_id')['price'].mean() prices for each restaurant. I want to get standard deviations. However, r.mi.groupby('restaurant_id')['price'].std() returns invalid values .
As you can see, for the sake of simplicity, I only got one restaurant with four dishes. I want to find the standard deviation of the price. Just to make sure:
>>> np.mean([7,20,22,22]) 17.75 >>> np.std([7,20,22,22]) 6.2599920127744575
We can get the same (correct) values ββwith
>>> np.mean(df) restaurant_id 10407.00 price 17.75 dtype: float64 >>> np.std(df) restaurant_id 0.000000 price 6.259992 dtype: float64
(Of course, np.std(df) does not np.std(df) attention to the average identifier of a restaurant.) Obviously, np.std(df) not a solution if I have more than one restaurant. So I use groupby .
>>> df.groupby('restaurant_id').agg('std') price restaurant_id 10407 7.228416
What kind?! 7.228416 is not 6.259992.
Let's try again.
>>> df.groupby('restaurant_id').std()
Same.
>>> df.groupby('restaurant_id')['price'].std()
Same.
>>> df.groupby('restaurant_id').apply(lambda x: x.std())
Same.
However, this works:
for id, group in df.groupby('restaurant_id'): print id, np.std(group['price'])
Question : is there a correct way to aggregate data, so will I get a new time series with standard deviations for each restaurant?