I have several TB data (in subsets) in flat files that I want to convert to HDF5 using Python Pandas / Pytables / H5py for faster querying and searching. I plan to convert each subsection of the data using something like to_hdfand save them in HDFStore.
Despite the fact that the stored data never needs to be changed, I may need to add the data later to a specific subsection, and then re-index the entire fragment (for queries).
My question is this: is it more efficient to add data to an existing table (using store.append) and then reindex a new table, or just create a new table with the data I need to add?
If I do the latter, I can create LOT (over 100k) nodes in the HDSFStore. Would it make node access time worse?
I tried to look at other answers, and also created my own store with a bunch of nodes to see if there was an effect, but I could not find anything significant. Any help is appreciated!
source
share