HDFStore: efficiency between adding data to an existing table and reindexing versus creating a new table

Question

HDFStore: efficiency between adding data to an existing table and reindexing versus creating a new table

I have several TB data (in subsets) in flat files that I want to convert to HDF5 using Python Pandas / Pytables / H5py for faster querying and searching. I plan to convert each subsection of the data using something like to_hdfand save them in HDFStore.

Despite the fact that the stored data never needs to be changed, I may need to add the data later to a specific subsection, and then re-index the entire fragment (for queries).

My question is this: is it more efficient to add data to an existing table (using store.append) and then reindex a new table, or just create a new table with the data I need to add?

If I do the latter, I can create LOT (over 100k) nodes in the HDSFStore. Would it make node access time worse?

I tried to look at other answers, and also created my own store with a bunch of nodes to see if there was an effect, but I could not find anything significant. Any help is appreciated!

+4

python hdf5 hdfstore

Cuber Aug 7 '17 at 13:47

source share

1 answer