Break Python sequence (time series / array) into subsequences with overlapping

I need to extract all subsequences of a time series / array of a given window. For instance:

>>> ts = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> window = 3 >>> subsequences(ts, window) array([[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7], [5, 7, 8], [6, 8, 9]]) 

Naive methods that iterate over a sequence are, of course, expensive, for example:

 def subsequences(ts, window): res = [] for i in range(ts.size - window + 1): subts = ts[i:i+window] subts.reset_index(drop=True, inplace=True) subts.name = None res.append(subts) return pd.DataFrame(res) 

I found a better way by copying the sequence, shifting it to a different value until the window is closed, and dividing the different sequences into reshape . Performance is about 100 times better because a for loop iterates over the size of the window, not the size of the sequence:

 def subsequences(ts, window): res = [] for i in range(window): subts = ts.shift(-i)[:-(ts.size%window)].reshape((ts.size // window, window)) res.append(subts) return pd.DataFrame(np.concatenate(res, axis=0)) 

I saw that pandas includes several rolling functions in the pandas.stats.moment module, and I assume that they do it somehow seems like a subsequence problem. Is there anywhere in this module or elsewhere in pandas to make this more efficient?

Thanks!

UPDATE (SOLUTION):

Based on @elyase's answer, there is a slightly simpler implementation for this particular case, let me write it here and explain what it does:

 def subsequences(ts, window): shape = (ts.size - window + 1, window) strides = ts.strides * 2 return np.lib.stride_tricks.as_strided(ts, shape=shape, strides=strides) 

Given a 1-D numpy array, we first compute the shape of the resulting array. We will have a line starting at each position of the array, with the exception of only the last few elements, at which their launch will not be enough to fill the window.

See the first example in this description for the last number we start is 6, because starting at 7 we cannot create a window of three elements. Thus, the number of lines is the size minus the window plus one. The number of columns is just a window.

Next, the tricky part describes how to populate the resulting array using the form you just defined.

We believe that the first element will be the first. Then we need to specify two values โ€‹โ€‹(in the court of two integers as an argument to the strides parameter). The values โ€‹โ€‹determine the steps that we must perform in the original array (one-dimensional) to fill the second (two-dimensional).

Consider another example where we want to implement the np.reshape function from an array of 9 1-D elements to a 3x3 array. The first element fills the first position, and then, to the right of it, will be the next in the 1-D array, so we will move 1 step. Then, the difficult part, to fill the first element of the second row, we have to take 3 steps, from 0 to 4, see

 >>> original = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8]) >>> new = array([[0, 1, 2], [3, 4, 5], [6, 7, 8])] 

So, before reshape our steps for two dimensions will be (1, 3) . For our case, when it exists, overlapping, it is actually easier. When we go right to fill the resulting array, we start from the next position in the 1-D array, and when we move to the right, we get the next element again, so 1 step, in the 1-D array. So the steps will be (1, 1) .

Only one last thing left. The strides argument strides not accept the โ€œstepsโ€ we used, but instead the bytes in memory. To recognize them, we can use the strides method of strides arrays. It returns a tuple with steps (steps in bytes) with one element for each dimension. In our case, we get 1 element of the tuple, and we want it twice, so we have * 2 .

The np.lib.stride_tricks.as_strided function performs filling using the described method without copying data, which makes it quite efficient.

Finally, note that the function posted here assumes a 1-D input array (which differs from a two-dimensional array with 1 element in the form of a row or column). See the Input Array Shape Method, and you should get something like (N, ) , not (N, 1) . This method failed on the latter. Please note that the method posted by @elyase processes two dimensional input arrays (why this version is a bit simpler).

+7
source share
3 answers

This is 34 times faster than your fast version on my machine:

 def rolling_window(a, window): shape = a.shape[:-1] + (a.shape[-1] - window + 1, window) strides = a.strides + (a.strides[-1],) return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides) >>> rolling_window(ts.values, 3) array([[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7], [6, 7, 8], [7, 8, 9]]) 

The loan goes to Eric Rigthorpe .

+9
source

It is worth noting that step tricks can have unforeseen consequences when working with a converted array. It is efficient because it modifies the memory pointers without creating a copy of the original array. When updating any values โ€‹โ€‹in the returned array, the values โ€‹โ€‹in the original array change, and vice versa.

 l = np.asarray([1,2,3,4,5,6,7,8,9]) _ = rolling_window(l, 3) print(_) array([[1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7], [6, 7, 8], [7, 8, 9]]) _[0,1] = 1000 print(_) array([[ 1, 1000, 3], [1000, 3, 4], [ 3, 4, 5], [ 4, 5, 6], [ 5, 6, 7], [ 6, 7, 8], [ 7, 8, 9]]) # create new matrix from original array xx = pd.DataFrame(rolling_window(l, 3)) # the updated values are still updated print(xx) 0 1 2 0 1 1000 3 1 1000 3 4 2 3 4 5 3 4 5 6 4 5 6 7 5 6 7 8 6 7 8 9 # change values in xx changes values in _ and l xx.loc[0,1] = 100 print(_) print(l) [[ 1 100 3] [100 3 4] [ 3 4 5] [ 4 5 6] [ 5 6 7] [ 6 7 8] [ 7 8 9]] [ 1 100 3 4 5 6 7 8 9] # make a dataframe copy to avoid unintended side effects new = xx.copy() # changing values in new won't affect l, _, or xx 

Any values โ€‹โ€‹that have been changed in xx or _ or l displayed in other variables, because they are all the same object in memory.

See numy docs for more details: numpy.lib.stride_tricks.as_strided

0
source

I would like to note that PyTorch offers the only function for this task, which is as memory efficient as the best solution to date with Torch tensors, but much simpler and more general (i.e. when working with several by measurements):

 # Import packages import torch import pandas as pd # Create array and set window size ts = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) window = 3 # Create subsequences with converting to/from Tensor ts_torch = torch.from_numpy(ts.values) # convert to torch Tensor ss_torch = ts_torch.unfold(0, window, 1) # create subsequences in-memory ss_numpy = ss_torch.numpy() # convert Tensor back to numpy (obviously now needs more memory) # Or just in a single line: ss_numpy = torch.from_numpy(ts.values).unfold(0, window, 1).numpy() 

The highlight is the unfold function, see the PyTorch documentation for a detailed explanation. Converting back to numpy may not be necessary if you can work directly with PyTorch tensors - in this case, the solution uses memory just as efficiently. In my use case, I found it easier to first create subsequences (and do other preprocessing) using Torch tensors and use .numpy() for these tensors to convert them to numy as needed.

0
source

Source: https://habr.com/ru/post/980737/


All Articles