I need to extract all subsequences of a time series / array of a given window. For instance:
>>> ts = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> window = 3 >>> subsequences(ts, window) array([[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7], [5, 7, 8], [6, 8, 9]])
Naive methods that iterate over a sequence are, of course, expensive, for example:
def subsequences(ts, window): res = [] for i in range(ts.size - window + 1): subts = ts[i:i+window] subts.reset_index(drop=True, inplace=True) subts.name = None res.append(subts) return pd.DataFrame(res)
I found a better way by copying the sequence, shifting it to a different value until the window is closed, and dividing the different sequences into reshape . Performance is about 100 times better because a for loop iterates over the size of the window, not the size of the sequence:
def subsequences(ts, window): res = [] for i in range(window): subts = ts.shift(-i)[:-(ts.size%window)].reshape((ts.size // window, window)) res.append(subts) return pd.DataFrame(np.concatenate(res, axis=0))
I saw that pandas includes several rolling functions in the pandas.stats.moment module, and I assume that they do it somehow seems like a subsequence problem. Is there anywhere in this module or elsewhere in pandas to make this more efficient?
Thanks!
UPDATE (SOLUTION):
Based on @elyase's answer, there is a slightly simpler implementation for this particular case, let me write it here and explain what it does:
def subsequences(ts, window): shape = (ts.size - window + 1, window) strides = ts.strides * 2 return np.lib.stride_tricks.as_strided(ts, shape=shape, strides=strides)
Given a 1-D numpy array, we first compute the shape of the resulting array. We will have a line starting at each position of the array, with the exception of only the last few elements, at which their launch will not be enough to fill the window.
See the first example in this description for the last number we start is 6, because starting at 7 we cannot create a window of three elements. Thus, the number of lines is the size minus the window plus one. The number of columns is just a window.
Next, the tricky part describes how to populate the resulting array using the form you just defined.
We believe that the first element will be the first. Then we need to specify two values โโ(in the court of two integers as an argument to the strides parameter). The values โโdetermine the steps that we must perform in the original array (one-dimensional) to fill the second (two-dimensional).
Consider another example where we want to implement the np.reshape function from an array of 9 1-D elements to a 3x3 array. The first element fills the first position, and then, to the right of it, will be the next in the 1-D array, so we will move 1 step. Then, the difficult part, to fill the first element of the second row, we have to take 3 steps, from 0 to 4, see
>>> original = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8]) >>> new = array([[0, 1, 2], [3, 4, 5], [6, 7, 8])]
So, before reshape our steps for two dimensions will be (1, 3) . For our case, when it exists, overlapping, it is actually easier. When we go right to fill the resulting array, we start from the next position in the 1-D array, and when we move to the right, we get the next element again, so 1 step, in the 1-D array. So the steps will be (1, 1) .
Only one last thing left. The strides argument strides not accept the โstepsโ we used, but instead the bytes in memory. To recognize them, we can use the strides method of strides arrays. It returns a tuple with steps (steps in bytes) with one element for each dimension. In our case, we get 1 element of the tuple, and we want it twice, so we have * 2 .
The np.lib.stride_tricks.as_strided function performs filling using the described method without copying data, which makes it quite efficient.
Finally, note that the function posted here assumes a 1-D input array (which differs from a two-dimensional array with 1 element in the form of a row or column). See the Input Array Shape Method, and you should get something like (N, ) , not (N, 1) . This method failed on the latter. Please note that the method posted by @elyase processes two dimensional input arrays (why this version is a bit simpler).