30,000 data points, find the biggest change in 2 weeks

I have:

- 30,000 data points - each data point is a measurement of type float - each measurement is associated with a date - each date has only one measurement - no dates are without measurements - the data comes in the form of a text file: 30,000 lines in this form: - YYYY-MM-DD I,F (eg 1977-02-08 20.74) - measurement appearing in the source file are already sorted by date 

I need:

 - a time-interval T with boundaries (s,e) /* start, end */ - (s - e = 14 days) the time-interval *must* be 2 weeks - define min as the lowest value in the interval T - define max as the greatest value in the interval T - the chosen T needs to have the greatest distance btwn max and min of all possible Ts - break ties among intervals T by choosing the most recent (with the greatest s value) - the chosen T must consider all jumps in the 14 days, not just the values @ s and e - if the overall "variance" in the interval is great but the jump |max-min| is not the greatest in absolute value, T is not the right choice, even if it an "exciting" interval 

I'm asking:

 - which algorithm to employ, considering algorithms are not my specialty - which data structure to use to keep track of the subtotals 

Note:

 - an answer in pseudo code would be preferred, "prose" is fine if pressured for time - an answer in Python would be... splendid :) 

If you want, you can generate "dummy" data and run the proposed algorithm as a test, or I can share the actual data.

I'm not interested in performance, except that you need to know the fastest way to do this in order to learn how to apply the right solution and the right algorithm.

I think I can "prove" the correctness even with the simplest iterative algorithm, because the data set is small, given today's computers.

So far I have “walked and transferred 14 vectors from 14 dimensions”, if you could teach me how to do it gradually with the help of sub-sums, it would be really appreciated.

+6
source share
2 answers

Sliding windows really work here, preserving two stacks (maybe this is a bit misleading, as this is probably best implemented as a double queue). Store the minstack stack and the stack called maxstack . The essence of the algorithm is that minstack should be strictly non-decreasing , and maxstack should not strictly increase at all points of the slide. So how do we do this?

First add the first 14 points to the stack. Define add(point) as:

Do this for minstack:

  • While the dot is smaller than the top minstack element, remove the top minstack element.
  • Add a point to minstack.

Similarly, for maxstack:

  • While the new dot is larger than the top maxstack element, remove the top maxstack element.
  • Add a point to maxstack.

Due to the property above, the min and max of the first 14 elements must be the lower elements of minstack and maxstack. Now slide the window. We just have to note that if the left point is still “alive” in any of the stacks, it must be the bottom point. Therefore, it should be easy, simple:

 slide(): add(new_point) if (left_point == bottom(minstack)) remove_bottom(minstack) if (left_point == bottom(maxstack)) remove_bottom(maxstack) 

Do this until your points are exhausted. The interval you are looking for is the one in which bottom(maxstack) - bottom(minstack) was the largest.

Please note that any point enters minstack / maxstack no more than once, and each point also leaves stacks no more than once, therefore for each point no more than 4 operations, regardless of the size of the required interval.

EDIT: I just noticed that you want to implement in Python. I really did not want to analyze the data, so the function takes a list of values ​​as input and displays the indices (s, e) in this array:

 import collections def add(x, minstack, maxstack): while minstack and x < minstack[-1]: minstack.pop() while maxstack and x > maxstack[-1]: maxstack.pop() minstack.append(x) maxstack.append(x) def get_largest_interval(points): minstack = collections.deque() maxstack = collections.deque() best_diff = -1 best_interval = None for index, elem in enumerate(points): add(elem,minstack,maxstack) if index >= 14: if minstack[0] == points[index-14]: minstack.popleft() if maxstack[0] == points[index-14]: maxstack.popleft() if index >= 13: this_diff = maxstack[0]-minstack[0] if best_diff == -1 or this_diff >= best_diff: best_interval = (index-13, index) best_diff = this_diff return best_interval print get_largest_interval([0, 2, 2,2,2,2,2,2,2,2,2,2,2,2,3]) 
+2
source

If I understand you, you have:

30,000 different, ordered data values. The order is executed by date, but it does not matter.

This set has 29,986 subsets in which the content is an ordered sequence that starts from one data point and contains this starting point and the thirteen following data points.


Taking it very slowly:

1) read your 30,000 data points into an array of size 30,000.

2) allocate an array of size 29.986. Call this array Potential Winners.

3) fill in the array of potential winners by scanning each 14-point subset while temporarily holding the maximum value and min value found in the subset. When these two values ​​are in hand, save (Max-Min) at the location of the index - the starting point - within the potential winners. Do not try to optimize sliding windows; See below.

4) Do a linear scan of potential winners, keeping the value and (importantly) the index in which it is located. BTW: what do you do if there is no winner? If all data points have the same value, you will receive 29,986 winning candidates with the same value.

5) Optimization: do not select and fill potential winners; initialize the current winner in the tuple (value, index) as (0, -1). Calculate the value of each 14-point subset as above, but save only the best value for {Current Winner, "the value I get from this current subset"}

6) Sliding windows: I did not think about it, but I think that saving a sliding window is more work than the simple linear passage described above. Reason: ok, calculate the value of the first 14 points; get min and max and get the interval between them. But wait, we need the min and max values ​​for use in the next window. Now slide the window up one position. The value at the left end is missing; but was it minimal, maximal, or intermediate? Suppose it was a minus, and now it has disappeared. What is the second smallest min? We do not have this information.

To save the sliding window, you need to sort each 14-point subsequence and remember the index position of all values. Then, when you move, you can find out whether the value that fell to the left is either the old min or the old value of max, and whether there will be a new value entering on the right, either a new minimum or a new maximum. But it is not worth the effort.

(This situation is a bit like the Boyer-Moore quick fit algorithm. I don’t remember the details, but it involves pre-processing the entire input and saving the table of places where each value occurs. But this is far from the topic)



Hope this helps ...

+1
source

Source: https://habr.com/ru/post/918143/


All Articles