Adding a column to an HDF file frame in Pandas

I work with a large dataset in CSV format. I am trying to process columns by data columns and then add data to the frame in the HDF file. All this is done using Pandas. My motivation is that although the entire data set is much larger than my physical memory, the column size is manageable. At a later stage, I will perform functional logistic regression, loading the columns back into memory one by one and working on them.

I can create a new HDF file and create a new frame with the first column:

hdf_file = pandas.HDFStore('train_data.hdf') feature_column = pandas.read_csv('data.csv', usecols=[0]) hdf_file.append('features', feature_column) 

But after that, I get a ValueError when I try to add a new column to the frame:

 feature_column = pandas.read_csv('data.csv', usecols=[1]) hdf_file.append('features', feature_column) 

Stack trace and error message:

 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 658, in append self._write_to_group(key, value, table=True, append=True, **kwargs) File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 923, in _write_to_group s.write(obj = value, append=append, complib=complib, **kwargs) File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2985, in write **kwargs) File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2675, in create_axes raise ValueError("cannot match existing table structure for [%s] on appending data" % items) ValueError: cannot match existing table structure for [srch_id] on appending data 

I am new to working with large data sets and limited memory, so I am open to suggestions on alternative ways of working with this data.

+9
source share
2 answers

full documents here and some cookbook strategies here

PyTables is row oriented so you can only add rows. Read the csv chunk-by-chunk, then add the whole frame as you go, something like this:

 store = pd.HDFStore('file.h5',mode='w') for chunk in read_csv('file.csv',chunksize=50000): store.append('df',chunk) store.close() 

You have to be a little careful, since it is possible that the dtypes of the resulting frame when reading the chunk per chunk have different dtypes, for example, you have an integer column in which there are no missing values, say, up to the 2nd chunk. The first block will have this column as int64 , and the second as float64 . You may need to force dtypes to be set with the dtype word read_csv for read_csv , see here .

that's the same question.

+23
source

In 2019, you could use Dask to lazily load your data.

0
source

Source: https://habr.com/ru/post/959490/


All Articles