I would use HDF5 / pytables as follows:
- Store data as a python "as long as possible" list.
- Add results to this list.
- When it becomes "big":
- click on the HDF5 Store using pandas io (and the added table).
- clear the list.
- Repeat
In fact, the function I define uses a list for each “key” so that you can store multiple DataFrames in HDF5 storage in the same process.
We define a function that you call with each d line:
CACHE = {} STORE = 'store.h5' # Note: another option is to keep the actual file open def process_row(d, key, max_len=5000, _cache=CACHE): """ Append row d to the store 'key'. When the number of items in the key cache reaches max_len, append the list of rows to the HDF5 store and clear the list. """ # keep the rows for each key separate. lst = _cache.setdefault(key, []) if len(lst) >= max_len: store_and_clear(lst, key) lst.append(d) def store_and_clear(lst, key): """ Convert key cache list to a DataFrame and append that to HDF5. """ df = pd.DataFrame(lst) with pd.HDFStore(STORE) as store: store.append(key, df) lst.clear()
Note. We use the with statement to automatically close the repository after each record. It may be faster to keep it open, but if it is recommended to update it regularly (closing flushes) . Also note that it is more readable to use the deque collection rather than the list, but the performance of the list will be slightly better.
To use this, you call:
process_row({'time' :'2013-01-01 00:00:00', 'stock' : 'BLAH', 'high' : 4.0, 'low' : 3.0, 'open' : 2.0, 'close' : 1.0}, key="df")
Note: "df" is the saved key used in the pytables repository.
After completing the job, make sure you store_and_clear rest of the cache:
for k, lst in CACHE.items():
Your full DataFrame is now available through:
with pd.HDFStore(STORE) as store: df = store["df"]
Some comments:
- 5000 can be customized, try with smaller / larger numbers to suit your needs.
- The append list is O (1) , the DataFrame append is O (
len(df) ). - As long as you do not perform statistics or drag and drop data, you do not need pandas, use the fastest.
- This code works with several keys (data points).
- This is very small code, and we remain in the python vanilla list, and then pandas dataframe ...
In addition, to get updated data, you can define a get method that saves and clears before reading. This way you get the latest data:
def get_latest(key, _cache=CACHE): store_and_clear(_cache[key], key) with pd.HDFStore(STORE) as store: return store[key]
Now that you get access:
df = get_latest("df")
you will get the latest df version available.
Another option is slightly more complicated: define a user table in vanilla pytables, see the tutorial .
Note. To create a column descriptor, you need to know the field names.