Python: save pandas data frame to parquet file

Is it possible to save a pandas data frame directly to a parquet file? If not, what will be the proposed process?

The goal is to send the parquet file to another team for which they can use the scala code to read / open. Thank!

+9
source share
5 answers

Pandas have a primary function to_parquet(). Just write the data frame in the parquet format as follows:

df.to_parquet('myfile.parquet')

, fastparquet. , , , ( ). :

df.to_parquet('myfile.parquet', engine='fastparquet')
+8

fastparquet - .

https://github.com/dask/fastparquet

conda install -c conda-forge fastparquet

pip install fastparquet

from fastparquet import write 
write('outfile.parq', df)

, , /:

write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000], compression='GZIP', file_scheme='hive')
+7

pyarrow pandas :

import pyarrow

pyarrow.Table.from_pandas(dataset)
+2

, - - :

import pandas as pd 

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})

(, , : $ conda install fastparquet)

import fastparquet

convert data frame to parquet and save to current directory

df.to_parquet('df.parquet.gzip', compression='gzip')

read the parquet file in the current directory, back to the pandas data frame

pd.read_parquet('df.parquet.gzip')

output:

    col1    col2
0    1       3
1    2       4
+1
source

Yes it is possible. Here is a sample code:

import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
table = pa.Table.from_pandas(df, preserve_index=True)
pq.write_table(table, 'output.parquet')
+1
source

Source: https://habr.com/ru/post/1663357/


All Articles