Algorithm: is there a linear trend in the data?

I have continuously input data represented by an array of integer x = [x1, ..., xn], n <1 000 000. Each two elements satisfy the following condition x [i] x [i + 1].

I need to find a breakpoint as quickly as possible, where the linear trend of this data ends and turns into a quadratic trend. Data always starts with a linear trend ...

I tried to calculate

k = (x[i+1] - x[i])/ (x[i] - x[i-1]) 

but this test is not very reliable ... Maybe there is a simpler and more efficient statistical test ... In this case, the calculation of the regression line is slower ...

Thank you for your help...

+2
source share
4 answers

In fact, you compute the derivative of the function. Perhaps you should use more points to calculate it, for example. 5, see Five-point stencil

0
source

Keep track of the first derivation and second derivation. That is, keep the mean and variance x [i] -x [i-1]. And save the sum and variance (x [i + 1] -x [i]) - (x [i] -x [i-1]).

For a linear trend, the average value of the first derivative should be constant, and if you notice a deviation from the average (which you can calculate using the variance), then you can say that something is wrong. The average value of the second derivative should be 0.

For a quadratic trend, the average value of the first derivative increases. Thus, you will find many samples with a large deviation from the average. The second derivative behavior is similar to the behavior of the first derivative in the linear case.

Algorithm (using only the second derivative):

  • For each input, calculate the sign (+ ve or -ve) of the second derivative
  • Keep track of how many recently appeared homogeneous characters (i.e. if the sequence is + - ++++ answer 4)
  • If the length of the homogeneous signs is greater than the threshold (say, 40?), Then mark it as the beginning of a quadratic sequence
+1
source

Here you can use the regression of the current window.

The calculation of linear regression coefficients for points W includes sums of terms of the form X [i], iX [i] and X [i] ^ 2. If you store these sums, you can easily move one point by deriving the terms for the leftmost point and adding terms for the rightmost point (iX [i] becomes (i + 1) .X [i], ieiX [i] + X [I]). Your data values ​​are integer, there will be no rounding accumulation.

This means that you can calculate the current regression in constant time for each consecutive point W and determine the fall of the correlation coefficient.

0
source

For an ultrafast solution, you can consider a test such as:

 | X[i + s] - 2 X[i] + X[i - s] | > k (X[i + s] - X[i - s]) 

for well-chosen s and k.

Look at the plot | X [i + s] - 2 X [i] + X [i - s] | / (X [i + s] - X [i - s]) as a function of i for increasing s.

0
source

Source: https://habr.com/ru/post/949244/


All Articles