I work with a large dataset in CSV format. I am trying to process columns by data columns and then add data to the frame in the HDF file. All this is done using Pandas. My motivation is that although the entire data set is much larger than my physical memory, the column size is manageable. At a later stage, I will perform functional logistic regression, loading the columns back into memory one by one and working on them.
I can create a new HDF file and create a new frame with the first column:
hdf_file = pandas.HDFStore('train_data.hdf') feature_column = pandas.read_csv('data.csv', usecols=[0]) hdf_file.append('features', feature_column)
But after that, I get a ValueError when I try to add a new column to the frame:
feature_column = pandas.read_csv('data.csv', usecols=[1]) hdf_file.append('features', feature_column)
Stack trace and error message:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 658, in append self._write_to_group(key, value, table=True, append=True, **kwargs) File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 923, in _write_to_group s.write(obj = value, append=append, complib=complib, **kwargs) File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2985, in write **kwargs) File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2675, in create_axes raise ValueError("cannot match existing table structure for [%s] on appending data" % items) ValueError: cannot match existing table structure for [srch_id] on appending data
I am new to working with large data sets and limited memory, so I am open to suggestions on alternative ways of working with this data.