How to process incoming data in real time using python pandas

Question

How to process incoming data in real time using python pandas

What is the most recommended / pythonic way to handle live input using pandas?

Every few seconds I get a data point in the following format:

{'time' :'2013-01-01 00:00:00', 'stock' : 'BLAH', 'high' : 4.0, 'low' : 3.0, 'open' : 2.0, 'close' : 1.0}

I would like to add it to an existing DataFrame and then run some analysis.

The problem is that just adding rows using DataFrame.append can lead to performance problems with this copy.

What I tried:

Several people suggested pre-allocating a large DataFrame and updating it as the data arrives:

 In [1]: index = pd.DatetimeIndex(start='2013-01-01 00:00:00', freq='S', periods=5) In [2]: columns = ['high', 'low', 'open', 'close'] In [3]: df = pd.DataFrame(index=t, columns=columns) In [4]: df Out[4]: high low open close 2013-01-01 00:00:00 NaN NaN NaN NaN 2013-01-01 00:00:01 NaN NaN NaN NaN 2013-01-01 00:00:02 NaN NaN NaN NaN 2013-01-01 00:00:03 NaN NaN NaN NaN 2013-01-01 00:00:04 NaN NaN NaN NaN In [5]: data = {'time' :'2013-01-01 00:00:02', 'stock' : 'BLAH', 'high' : 4.0, 'low' : 3.0, 'open' : 2.0, 'close' : 1.0} In [6]: data_ = pd.Series(data) In [7]: df.loc[data['time']] = data_ In [8]: df Out[8]: high low open close 2013-01-01 00:00:00 NaN NaN NaN NaN 2013-01-01 00:00:01 NaN NaN NaN NaN 2013-01-01 00:00:02 4 3 2 1 2013-01-01 00:00:03 NaN NaN NaN NaN 2013-01-01 00:00:04 NaN NaN NaN NaN

Another alternative is to create a list of dicts. Just adding the input data to the list and cutting it into smaller DataFrames to do the job.

 In [9]: ls = [] In [10]: for n in range(5): .....: # Naive stuff ahead =) .....: time = '2013-01-01 00:00:0' + str(n) .....: d = {'time' : time, 'stock' : 'BLAH', 'high' : np.random.rand()*10, 'low' : np.random.rand()*10, 'open' : np.random.rand()*10, 'close' : np.random.rand()*10} .....: ls.append(d) In [11]: df = pd.DataFrame(ls[1:3]).set_index('time') In [12]: df Out[12]: close high low open stock time 2013-01-01 00:00:01 3.270078 1.008289 7.486118 2.180683 BLAH 2013-01-01 00:00:02 3.883586 2.215645 0.051799 2.310823 BLAH

or something like that, maybe processing the input is a bit more.

+47

python pandas

Marcelo MD May 24 '13 at 17:53

source share

2 answers

Brent Washburne · Answer 1 · 2013-06-12 00:40

In fact, you are trying to solve two problems: collecting real-time data and analyzing this data. The first problem can be solved using Python logging , which is designed for this purpose. Then another problem can be solved by reading the same log file.

Andy Hayden · Answer 2 · 2015-12-15 06:22

I would use HDF5 / pytables as follows:

Store data as a python "as long as possible" list.
Add results to this list.
When it becomes "big":
- click on the HDF5 Store using pandas io (and the added table).
- clear the list.
Repeat

In fact, the function I define uses a list for each “key” so that you can store multiple DataFrames in HDF5 storage in the same process.

We define a function that you call with each d line:

 CACHE = {} STORE = 'store.h5' # Note: another option is to keep the actual file open def process_row(d, key, max_len=5000, _cache=CACHE): """ Append row d to the store 'key'. When the number of items in the key cache reaches max_len, append the list of rows to the HDF5 store and clear the list. """ # keep the rows for each key separate. lst = _cache.setdefault(key, []) if len(lst) >= max_len: store_and_clear(lst, key) lst.append(d) def store_and_clear(lst, key): """ Convert key cache list to a DataFrame and append that to HDF5. """ df = pd.DataFrame(lst) with pd.HDFStore(STORE) as store: store.append(key, df) lst.clear()

Note. We use the with statement to automatically close the repository after each record. It may be faster to keep it open, but if it is recommended to update it regularly (closing flushes) . Also note that it is more readable to use the deque collection rather than the list, but the performance of the list will be slightly better.

To use this, you call:

 process_row({'time' :'2013-01-01 00:00:00', 'stock' : 'BLAH', 'high' : 4.0, 'low' : 3.0, 'open' : 2.0, 'close' : 1.0}, key="df")

Note: "df" is the saved key used in the pytables repository.

After completing the job, make sure you store_and_clear rest of the cache:

 for k, lst in CACHE.items(): # you can instead use .iteritems() in python 2 store_and_clear(lst, k)

Your full DataFrame is now available through:

 with pd.HDFStore(STORE) as store: df = store["df"] # other keys will be store[key]

Some comments:

5000 can be customized, try with smaller / larger numbers to suit your needs.
The append list is O (1) , the DataFrame append is O ( len(df) ).
As long as you do not perform statistics or drag and drop data, you do not need pandas, use the fastest.
This code works with several keys (data points).
This is very small code, and we remain in the python vanilla list, and then pandas dataframe ...

In addition, to get updated data, you can define a get method that saves and clears before reading. This way you get the latest data:

 def get_latest(key, _cache=CACHE): store_and_clear(_cache[key], key) with pd.HDFStore(STORE) as store: return store[key]

Now that you get access:

 df = get_latest("df")

you will get the latest df version available.

Another option is slightly more complicated: define a user table in vanilla pytables, see the tutorial .

Note. To create a column descriptor, you need to know the field names.

How to process incoming data in real time using python pandas

What I tried:

Some comments:

More articles: