Convert HDF5 to parquet without loading into memory

I have a large data set (~ 600 GB) that is stored in HDF5 format. Since this is too large to fit in memory, I would like to convert it to the Parquet format and use pySpark to perform preliminary data preprocessing (normalization, search for correlation matrices, etc.). However, I'm not sure how to convert the entire dataset to Parquet without loading it into memory.

I looked at this topic: https://gist.github.com/jiffyclub/905bf5e8bf17ec59ab8f#file-hdf_to_parquet-py , but it seems that the entire data set is being read into memory.

One thing I was thinking about was reading the HDF5 file in chunks and saving it step by step to the Parquet file:

test_store = pd.HDFStore('/path/to/myHDFfile.h5') nrows = test_store.get_storer('df').nrows chunksize = N for i in range(nrows//chunksize + 1): # convert_to_Parquet() ... 

But I can not find the documentation that would allow me to gradually build up the Parquet file. Any links for further reading would be appreciated.

+5
source share
1 answer

You can use pyarrow for this!

 import pandas as pd import pyarrow as pa import pyarrow.parquet as pq def convert_hdf5_to_parquet(h5_file, parquet_file, chunksize=100000): stream = pd.read_hdf(h5_file, chunksize=chunksize) for i, chunk in enumerate(stream): print("Chunk {}".format(i)) if i == 0: # Infer schema and open parquet file on first chunk parquet_schema = pa.Table.from_pandas(df=chunk).schema parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy') table = pa.Table.from_pandas(chunk, schema=parquet_schema) parquet_writer.write_table(table) parquet_writer.close() 
+7
source

Source: https://habr.com/ru/post/1271692/


All Articles