I import a large number of http logs (80GB +) into the HDFStore Pandas repository for statistical processing. Even in a single import file, I need to load the contents on load. My tactic so far has been to read the parsed lines into a DataFrame, and then save the DataFrame to an HDFStore. My goal is for the index key to be unique to one key in the DataStore, but each DataFrame will restart its index value again. I was expecting HDFStore.append () to have some mechanism to tell it to ignore the DataFrame index values and simply add HDFStore keywords to my existing values, but cannot find them. How to import DataFrames and ignore the index values contained in it, and HDFStore increases its existing index values? A sample example below, every 10 lines. Naturally, the real thing will be more.
if hd_file_name: """ HDF5 output file specified. """ hdf_output = pd.HDFStore(hd_file_name, complib='blosc') print hdf_output columns = ['source', 'ip', 'unknown', 'user', 'timestamp', 'http_verb', 'path', 'protocol', 'http_result', 'response_size', 'referrer', 'user_agent', 'response_time'] source_name = str(log_file.name.rsplit('/')[-1])
python pandas indexing dataframe hdfstore
Ben Scherrey Jun 08 '13 at 7:09 2013-06-08 07:09
source share