I have a large data set (~ 600 GB) that is stored in HDF5 format. Since this is too large to fit in memory, I would like to convert it to the Parquet format and use pySpark to perform preliminary data preprocessing (normalization, search for correlation matrices, etc.). However, I'm not sure how to convert the entire dataset to Parquet without loading it into memory.
I looked at this topic: https://gist.github.com/jiffyclub/905bf5e8bf17ec59ab8f#file-hdf_to_parquet-py , but it seems that the entire data set is being read into memory.
One thing I was thinking about was reading the HDF5 file in chunks and saving it step by step to the Parquet file:
test_store = pd.HDFStore('/path/to/myHDFfile.h5') nrows = test_store.get_storer('df').nrows chunksize = N for i in range(nrows//chunksize + 1):
But I can not find the documentation that would allow me to gradually build up the Parquet file. Any links for further reading would be appreciated.
source share