Python Pandas - random sampling of time series

Question

Python Pandas - random sampling of time series

New to Pandas, looking for the most effective way to do this.

I have a series of DataFrames. Each DataFrame has the same columns, but different indexes, and they are indexed by date. A series is indexed by a ticker symbol. Thus, each element in the sequence represents one time series for each individual stock performance.

I need to randomly create a list of n data frames, where each dataframe is a subset of some random assortment of available inventory history. This is normal if there is overlap, if end start dates are different.

This following code does this, but it is very slow, and I am wondering if there is a better way to do this:

code

def random_sample(data=None, timesteps=100, batch_size=100, subset='train'): if type(data) != pd.Series: return None if subset=='validate': offset = 0 elif subset=='test': offset = 200 elif subset=='train': offset = 400 tickers = np.random.randint(0, len(data), size=len(data)) ret_data = [] while len(ret_data) != batch_size: for t in tickers: data_t = data[t] max_len = len(data_t)-timesteps-1 if len(ret_data)==batch_size: break if max_len-offset < 0: continue index = np.random.randint(offset, max_len) d = data_t[index:index+timesteps] if len(d)==timesteps: ret_data.append(d) return ret_data

Profile Output:

 Timer unit: 1e-06 s File: finance.py Function: random_sample at line 137 Total time: 0.016142 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 137 @profile 138 def random_sample(data=None, timesteps=100, batch_size=100, subset='train'): 139 1 5 5.0 0.0 if type(data) != pd.Series: 140 return None 141 142 1 1 1.0 0.0 if subset=='validate': 143 offset = 0 144 1 1 1.0 0.0 elif subset=='test': 145 offset = 200 146 1 0 0.0 0.0 elif subset=='train': 147 1 1 1.0 0.0 offset = 400 148 149 1 1835 1835.0 11.4 tickers = np.random.randint(0, len(data), size=len(data)) 150 151 1 2 2.0 0.0 ret_data = [] 152 2 3 1.5 0.0 while len(ret_data) != batch_size: 153 116 148 1.3 0.9 for t in tickers: 154 116 2497 21.5 15.5 data_t = data[t] 155 116 317 2.7 2.0 max_len = len(data_t)-timesteps-1 156 116 80 0.7 0.5 if len(ret_data)==batch_size: break 157 115 69 0.6 0.4 if max_len-offset < 0: continue 158 159 100 101 1.0 0.6 index = np.random.randint(offset, max_len) 160 100 10840 108.4 67.2 d = data_t[index:index+timesteps] 161 100 241 2.4 1.5 if len(d)==timesteps: ret_data.append(d) 162 163 1 1 1.0 0.0 return ret_data

+4

python pandas

Dave s Nov 05 '12 at 19:51

source share

1 answer

Aman · Answer 1 · 2012-11-05T21:10:18+0000

Are you sure you need to find a faster method? Your current method is not so slow. The following changes may be simplified, but not necessarily faster:

Step 1: Take a random sample (with replacement) from the list of data files

 rand_stocks = np.random.randint(0, len(data), size=batch_size)

You can think of this array of rand_stocks as a list of indexes that will be applied to your data series. The size is already the size of the batch, which eliminates the need for a while loop and your comparison on line 156.

That is, you can rand_stocks over rand_stocks and access Scott as follows:

 for idx in rand_stocks: stock = data.ix[idx] # Get a sample from this stock.

Step 2: Get a random datarange for each stock that you randomly selected.

 start_idx = np.random.randint(offset, len(stock)-timesteps) d = data_t[start_idx:start_idx+timesteps]

I do not have your data, but here is how I add them:

 def random_sample(data=None, timesteps=100, batch_size=100, subset='train'): if subset=='train': offset = 0 #you can obviously change this back rand_stocks = np.random.randint(0, len(data), size=batch_size) ret_data = [] for idx in rand_stocks: stock = data[idx] start_idx = np.random.randint(offset, len(stock)-timesteps) d = stock[start_idx:start_idx+timesteps] ret_data.append(d) return ret_data

Creating a dataset:

 In [22]: import numpy as np In [23]: import pandas as pd In [24]: rndrange = pd.DateRange('1/1/2012', periods=72, freq='H') In [25]: rndseries = pd.Series(np.random.randn(len(rndrange)), index=rndrange) In [26]: rndseries.head() Out[26]: 2012-01-02 2.025795 2012-01-03 1.731667 2012-01-04 0.092725 2012-01-05 -0.489804 2012-01-06 -0.090041 In [27]: data = [rndseries,rndseries,rndseries,rndseries,rndseries,rndseries]

Function Testing:

 In [42]: random_sample(data, timesteps=2, batch_size = 2) Out[42]: [2012-01-23 1.464576 2012-01-24 -1.052048, 2012-01-23 1.464576 2012-01-24 -1.052048]

Python Pandas - random sampling of time series

More articles: