The right way to get confidence interval using scipy

I have a 1-dimensional data array:

a = np.array([1,2,3,4,4,4,5,5,5,5,4,4,4,6,7,8]) 

for which I want to get a confidence interval of 68% (i.e. 1 sigma ).

The first comment in this answer states that this can be achieved using scipy.stats.norm.interval from scipy.stats.norm using:

 from scipy import stats import numpy as np mean, sigma = np.mean(a), np.std(a) conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma) 

But a comment in this post claims that the actual correct way to get the confidence interval is:

 conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a))) 

those. the sigma coefficient is 1/np.sqrt(len(a)) .

Question: which version is correct?

+17
python numpy scipy confidence-interval
Jan 30 '15 at 18:41
source share
3 answers

68% confidence interval for one draw of the normal distribution with mean mu and std sigma

 stats.norm.interval(0.68, loc=mu, scale=sigma) 

The 68% confidence interval for the average value of N draws from the normal distribution with the average value of mu and std sigma is

 stats.norm.interval(0.68, loc=mu, scale=sigma/sqrt(N)) 

Intuitively, these formulas make sense, because if you hold up a jar of jelly beans and ask a large number of people to guess the number of jelly beans, each person can be turned off a lot - the same deviation is std sigma , but the average value of the guesswork will do a wonderful job of estimating the actual number, and this is reflected by the standard deviation of the average reduction in the coefficient 1/sqrt(N) .




If one draw has a variance of sigma**2 , then according to the Bienaymรฉ formula , the sum of N uncorrelated draws has a variance of N*sigma**2 .

The average value is equal to the sum divided by N. When you multiply a random variable (such as the sum) by a constant, the variance is multiplied by the square of the constant. it

 Var(cX) = c**2 * Var(X) 

Thus, the variance of the mean is

 (variance of the sum)/N**2 = N * sigma**2 / N**2 = sigma**2 / N 

and therefore the standard deviation of the mean (which is the square root of the variance) is

 sigma/sqrt(N). 

This is the beginning of sqrt(N) in the denominator.




Here is an example of code based on Tom code that demonstrates the above statements:

 import numpy as np from scipy import stats N = 10000 a = np.random.normal(0, 1, N) mean, sigma = a.mean(), a.std(ddof=1) conf_int_a = stats.norm.interval(0.68, loc=mean, scale=sigma) print('{:0.2%} of the single draws are in conf_int_a' .format(((a >= conf_int_a[0]) & (a < conf_int_a[1])).sum() / float(N))) M = 1000 b = np.random.normal(0, 1, (N, M)).mean(axis=1) conf_int_b = stats.norm.interval(0.68, loc=0, scale=1 / np.sqrt(M)) print('{:0.2%} of the means are in conf_int_b' .format(((b >= conf_int_b[0]) & (b < conf_int_b[1])).sum() / float(N))) 

prints

 68.03% of the single draws are in conf_int_a 67.78% of the means are in conf_int_b 

Remember that if you define conf_int_b with ratings for mean and sigma based on pattern a , the average value may not fall in conf_int_b with the desired frequency.




If you take a sample from the distribution and calculate the average value of the sample and the deviation std,

 mean, sigma = a.mean(), a.std() 

be careful to note that there is no guarantee that they are equal to the average value and standard deviations, and that we accept the population is usually distributed - this is not automatic money!

If you take a sample and want to estimate the average and standard population deviations, you should use

 mean, sigma = a.mean(), a.std(ddof=1) 

since this value for sigma is an unbiased estimate for the standard deviation of the population.

+38
Jan 30 '15 at 19:32
source share

I just checked how R and GraphPad calculate confidence intervals, and they increase the interval in case of small sample size (n). For example, more than 6 times for n = 2 compared with large n. This code (based on shasan answer ) corresponds to their confidence intervals:

 import numpy as np, scipy.stats as st # returns confidence interval of mean def confIntMean(a, conf=0.95): mean, sem, m = np.mean(a), st.sem(a), st.t.ppf((1+conf)/2., len(a)-1) return mean - m*sem, mean + m*sem 

For R, I checked against t.test (a). GraphPad middle page confidence interval contains user level information based on sample size.

Here's the output for Gabriel's example:

 In [2]: a = np.array([1,2,3,4,4,4,5,5,5,5,4,4,4,6,7,8]) In [3]: confIntMean(a, 0.68) Out[3]: (3.9974214366806184, 4.877578563319382) In [4]: st.norm.interval(0.68, loc=np.mean(a), scale=st.sem(a)) Out[4]: (4.0120010966037407, 4.8629989033962593) 

Note that the difference between the confIntMean() and st.norm.interval() intervals is relatively small here; len (a) == 16 is not too small.

+5
Oct 27 '15 at 17:28
source share

I tested your methods using an array with a known confidence interval. numpy.random.normal (mu, std, size) returns an array centered in mu with a standard deviation of std (in docs , this is defined as Standard deviation (spread or "width") of the distribution. ).

 from scipy import stats import numpy as np from numpy import random a = random.normal(0,1,10000) mean, sigma = np.mean(a), np.std(a) conf_int_a = stats.norm.interval(0.68, loc=mean, scale=sigma) conf_int_b = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a))) conf_int_a (-1.0011149125527312, 1.0059797764202412) conf_int_b (-0.0076030415111100983, 0.012467905378619625) 

Since the sigma value must be between -1 and 1, the / np.sqrt(len(a)) method seems to be incorrect.

Edit

Since I don't have a reputation to comment on above, I will explain how this answer relates to a thorough answer to unutbu. If you fill a random array with a normal distribution, 68% of the total will be within 1 & sigma; from average. In the above case, if you check what you see

 b = a[np.where((a>-1)&(a <1))] len(a) > 6781 

or 68% of the population is within 1 ?. Well, about 68%. Since you are using a larger and larger array, you will be close to 68% (in test 10, 9 were between -1 and 1). This is because 1- & sigma; is an integral data distribution, and the more data you have, the better you can solve it.

Basically, my interpretation of your question was: If I have a sample of the data that I want to use to describe the distribution from which they were made, then what method can be found for the standard deviation of this data? while the interpretation of unutbu seems more . What is the interval for which I can place the average with a confidence of 68%? . This will mean that for jelly beans I answered. As they guess and unutbu answered. What their guesses tell us about jelly beans.

+4
Jan 30 '15 at 19:29
source share



All Articles