Simple data processing

Say I got this dataset. After sorting, the distribution can be displayed as shown below.

M=[-99 -99 -44.5 -7.375 -5.5 -1.666666667 -1.333333333 -1.285714286 0.436363636 2.35 3.3 4.285714286 5.052631579 6.2 7.076923077 7.230769231 7.916666667 9.7 10.66666667 16.16666667 17.4 19.2 19.6 20.75 24.25 34.5 49.5] 

plot for the data

My question is how to find out the values ​​that belong to the middle range and write indexes. Using normal distribution or anything else? Appreciate your help!

Image for Jonas enter image description here

+4
source share
2 answers

Assuming your average range is [-10 10], the indices will be:

 > find(-10< M & M< 10) ans = 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

Note that you can also get values ​​by logical indexing, for example:

 > M(-10< M & M< 10) ans = Columns 1 through 15: -7.37500 -5.50000 -1.66667 -1.33333 and so on ... 

And to get the middle range, simply:

 > q= quantile(M(:), [.25 .75]) q = -1.3214 17.0917 > find(q(1)< M & M< q(2)) ans = 8 9 10 11 12 13 14 15 16 17 18 19 20 

Note also that M(:) used here to ensure that quantile treats M as a vector. You can accept the agreement that all vectors in your programs are column vectors, then most functions automatically process them correctly.

Update:
Now for a very short description of the quantiles it follows that they are points taken from the cumulative distribution function ( cdf ) of a random variable. (Now your M is considered a kind of cdf , since it is non-decreasing and can be normalized to the amount of up to 1). Now simply “quantile .5 of your data” means that 50% of the values ​​are lower than that quantile. "More detailed information on quantiles can be found, for example, here .

+6
source

If you do not know a priori what your average range is, but you know that you want to drop outliers at the beginning and at the end of our curve, and if you have the Statistics Toolbox, you can make a reliable linear regression with your data using ROBUSTFIT and save only sheets.

 M=[-99 -99 -44.5 -7.375 -5.5 -1.666666667 -1.333333333 -1.285714286 0.436363636 2.35 3.3 4.285714286 5.052631579 6.2 7.076923077 7.230769231 7.916666667 9.7 10.66666667 16.16666667 17.4 19.2 19.6 20.75 24.25 34.5 49.5]; %# robust linear regression x = find(isfinite(M)); %# eliminate NaN or Inf [u,s]=robustfit(x,M(x)); %# inliers have a weight > 0.25 (raise this value to be stricter) inlierIdx = sw > 0.25; middleRangeX = x(inlierIdx) middleRangeValues = M(x(inlierIdx)) %# plot with the regression in red and the good values in green plot(x,M(x),'-b.',x,u(1)+u(2)*x,'r') hold on,plot(middleRangeX,middleRangeValues,'*r') 

the plot

 middleRangeX = Columns 1 through 21 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Column 22 25 middleRangeValues = Columns 1 through 10 -7.375 -5.5 -1.6667 -1.3333 -1.2857 0.43636 2.35 3.3 4.2857 5.0526 Columns 11 through 20 6.2 7.0769 7.2308 7.9167 9.7 10.667 16.167 17.4 19.2 19.6 Columns 21 through 22 20.75 24.25 
+1
source

Source: https://habr.com/ru/post/1338530/


All Articles