Convert HDF5 to parquet without loading into memory

Question

Convert HDF5 to parquet without loading into memory

I have a large data set (~ 600 GB) that is stored in HDF5 format. Since this is too large to fit in memory, I would like to convert it to the Parquet format and use pySpark to perform preliminary data preprocessing (normalization, search for correlation matrices, etc.). However, I'm not sure how to convert the entire dataset to Parquet without loading it into memory.

I looked at this topic: https://gist.github.com/jiffyclub/905bf5e8bf17ec59ab8f#file-hdf_to_parquet-py , but it seems that the entire data set is being read into memory.

One thing I was thinking about was reading the HDF5 file in chunks and saving it step by step to the Parquet file:

test_store = pd.HDFStore('/path/to/myHDFfile.h5') nrows = test_store.get_storer('df').nrows chunksize = N for i in range(nrows//chunksize + 1): # convert_to_Parquet() ...

But I can not find the documentation that would allow me to gradually build up the Parquet file. Any links for further reading would be appreciated.

+5

python pandas hdf5 parquet hdf

Eweler Sep 11 '17 at 14:08

source share

1 answer

ostrokach · Accepted Answer · 2017-09-11T15:01:30+0000

You can use pyarrow for this!

 import pandas as pd import pyarrow as pa import pyarrow.parquet as pq def convert_hdf5_to_parquet(h5_file, parquet_file, chunksize=100000): stream = pd.read_hdf(h5_file, chunksize=chunksize) for i, chunk in enumerate(stream): print("Chunk {}".format(i)) if i == 0: # Infer schema and open parquet file on first chunk parquet_schema = pa.Table.from_pandas(df=chunk).schema parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy') table = pa.Table.from_pandas(chunk, schema=parquet_schema) parquet_writer.write_table(table) parquet_writer.close()

Convert HDF5 to parquet without loading into memory

More articles: