How to add large amounts of data to Pandas HDFStore and get a unique unique index?

I import a large number of http logs (80GB +) into the HDFStore Pandas repository for statistical processing. Even in a single import file, I need to load the contents on load. My tactic so far has been to read the parsed lines into a DataFrame, and then save the DataFrame to an HDFStore. My goal is for the index key to be unique to one key in the DataStore, but each DataFrame will restart its index value again. I was expecting HDFStore.append () to have some mechanism to tell it to ignore the DataFrame index values ​​and simply add HDFStore keywords to my existing values, but cannot find them. How to import DataFrames and ignore the index values ​​contained in it, and HDFStore increases its existing index values? A sample example below, every 10 lines. Naturally, the real thing will be more.

if hd_file_name: """ HDF5 output file specified. """ hdf_output = pd.HDFStore(hd_file_name, complib='blosc') print hdf_output columns = ['source', 'ip', 'unknown', 'user', 'timestamp', 'http_verb', 'path', 'protocol', 'http_result', 'response_size', 'referrer', 'user_agent', 'response_time'] source_name = str(log_file.name.rsplit('/')[-1]) # HDF5 Tables don't play nice with unicode so explicit str(). :( batch = [] for count, line in enumerate(log_file,1): data = parse_line(line, rejected_output = reject_output) # Add our source file name to the beginning. data.insert(0, source_name ) batch.append(data) if not (count % 10): df = pd.DataFrame( batch, columns = columns ) hdf_output.append(KEY_NAME, df) batch = [] if (count % 10): df = pd.DataFrame( batch, columns = columns ) hdf_output.append(KEY_NAME, df) 
+12
python pandas indexing dataframe hdfstore
Jun 08 '13 at 7:09
source share
1 answer

You can do it like this. The only trick is that for the first time the storage table does not exist, so get_storer will go up.

 import pandas as pd import numpy as np import os files = ['test1.csv','test2.csv'] for f in files: pd.DataFrame(np.random.randn(10,2),columns=list('AB')).to_csv(f) path = 'test.h5' if os.path.exists(path): os.remove(path) with pd.get_store(path) as store: for f in files: df = pd.read_csv(f,index_col=0) try: nrows = store.get_storer('foo').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows store.append('foo',df) In [10]: pd.read_hdf('test.h5','foo') Out[10]: AB 0 0.772017 0.153381 1 0.304131 0.368573 2 0.995465 0.799655 3 -0.326959 0.923280 4 -0.808376 0.449645 5 -1.336166 0.236968 6 -0.593523 -0.359080 7 -0.098482 0.037183 8 0.315627 -1.027162 9 -1.084545 -1.922288 10 0.412407 -0.270916 11 1.835381 -0.737411 12 -0.607571 0.507790 13 0.043509 -0.294086 14 -0.465210 0.880798 15 1.181344 0.354411 16 0.501892 -0.358361 17 0.633256 0.419397 18 0.932354 -0.603932 19 -0.341135 2.453220 

In fact, you don’t necessarily need a globally unique index (if you do not want this), since HDFStore (via PyTables ) provides a single value for unambiguous line numbering. You can always add these selection options.

 In [11]: pd.read_hdf('test.h5','foo',start=12,stop=15) Out[11]: AB 12 -0.607571 0.507790 13 0.043509 -0.294086 14 -0.465210 0.880798 
+13
Jun 08 '13 at 12:10
source share



All Articles