Python: save pandas data frame to parquet file

Question

Python: save pandas data frame to parquet file

Is it possible to save a pandas data frame directly to a parquet file? If not, what will be the proposed process?

The goal is to send the parquet file to another team for which they can use the scala code to read / open. Thank!

+9

python-3.x scala hdfs parquet

Edamame Dec 9 '16 at 18:20

source share

5 answers

ben26941 · Answer 1 · 2018-03-10T12:05:44+0000

Pandas have a primary function to_parquet(). Just write the data frame in the parquet format as follows:

df.to_parquet('myfile.parquet')

, fastparquet. , , , ( ). :

df.to_parquet('myfile.parquet', engine='fastparquet')

Mark S · Answer 2 · 2017-02-17T18:01:21+0000

fastparquet - .

https://github.com/dask/fastparquet

conda install -c conda-forge fastparquet

pip install fastparquet

from fastparquet import write 
write('outfile.parq', df)

, , /:

write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000], compression='GZIP', file_scheme='hive')

user113531 · Answer 3 · 2017-11-20T19:16:34+0000

pyarrow pandas :

import pyarrow

pyarrow.Table.from_pandas(dataset)

Grant shannon · Answer 4 · 2018-10-02T13:46:43+0000

, - - :

import pandas as pd

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})

(, , : $ conda install fastparquet)

import fastparquet

convert data frame to parquet and save to current directory

df.to_parquet('df.parquet.gzip', compression='gzip')

read the parquet file in the current directory, back to the pandas data frame

pd.read_parquet('df.parquet.gzip')

output:

    col1    col2
0    1       3
1    2       4

Lionel · Answer 5 · 2018-10-04T17:12:33+0000

Yes it is possible. Here is a sample code:

import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
table = pa.Table.from_pandas(df, preserve_index=True)
pq.write_table(table, 'output.parquet')

Python: save pandas data frame to parquet file

More articles: