How to find values ​​below (or above) the average

As you can see from the following summary report, the counter for September 1 (1542677) is below the monthly average.

from StringIO import StringIO

myst="""01/01/2016  8781262
01/02/2016  8958598
01/03/2016  8787628
01/04/2016  9770861
01/05/2016  8409410
01/06/2016  8924784
01/07/2016  8597500
01/08/2016  6436862
01/09/2016  1542677
"""
u_cols=['month', 'count']

myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep='\t', names = u_cols)

Is there a mathematical formula that can define this “road lower or too high” (ambiguous) concept?

This is easy if I define a limit (e.g. 9 or 10%). But I want the script to solve this for me and return the values ​​if the difference between the lowest and second last lowest value is more than just 5%. In this case, you need to return the account in September.

+4
source share
3 answers

. a zscore, , . , 2 . 5% .

zscore

def zscore(s):
    return (s - np.mean(s)) / np.std(s)

count

zscore(df['count'])

0    0.414005
1    0.488906
2    0.416694
3    0.831981
4    0.256946
5    0.474624
6    0.336390
7   -0.576197
8   -2.643349
Name: count, dtype: float64

, 2,6 .

abs gt

zscore(df['count']).abs().gt(2)

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8     True
Name: count, dtype: bool

, .

,

df[zscore(df['count']).abs().gt(2)]

enter image description here

df[zscore(df['count']).abs().le(2)]

enter image description here

+3

, " ", , Outlier ( ),

, ; , , .

:

, , , , , .

, , , , .

, , , , (. this , python).

. ( ) , "" (. postoverflow post python).

, 0% , . , ( ) - , . . , 2013 , , ​​ ,

, , . , , . , , Python , script (, Google).

, , :

: , , , ( ), , , , , .

: , , , : . " ", , , , , , - . ML, Andrew Ng .

, !

+2

(IQR, wikipedia), 75% (Q3) 25% (Q1).

Emissions are determined if data falls below Q1 - k * IQR, respectively. higher than Q3 + k * IQR.

You can choose the constant k based on the knowledge of your domain (the general choice is 1.5).

Given the data, a filter in pandas might look like this:

iqr_filter = pd.DataFrame(df["count"].quantile([0.25, 0.75])).T
iqr_filter["iqr"] = iqr_filter[0.75]-iqr_filter[0.25]
iqr_filter["lo"] = iqr_filter[0.25] - 1.5*iqr_filter["iqr"]
iqr_filter["up"] = iqr_filter[0.75] + 1.5*iqr_filter["iqr"]
df_filtered = df.loc[(df["count"] > iqr_filter["lo"][0]) & (df["count"] < iqr_filter["up"][0]), :]
+2
source

Source: https://habr.com/ru/post/1657237/


All Articles