Both are columnar (disk) storage formats for use in data analysis systems. Both are integrated into Apache Arrow (a pyarrow package for python) and designed to match Arrow as a columnar analytic layer in memory.
How are both formats different?
If you always prefer a pen when working with pandas, when is this possible?
What are the options for using feather , more suitable than parquet and vice versa?
application
I found some tips here https://github.com/wesm/feather/issues/188 , but given the young age of this project, it may be a bit outdated.
Not a serious speed test, because I'm just dumping and loading the whole Dataframe, but to give you some impression, if you've never heard of formats before:
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.feather as feather
import pyarrow.parquet as pq
import fastparquet as fp
df = pd.DataFrame({'one': [-1, np.nan, 2.5],
'two': ['foo', 'bar', 'baz'],
'three': [True, False, True]})
print("pandas df to disk ####################################################")
print('example_feather:')
%timeit feather.write_feather(df, 'example_feather')
print('example_parquet:')
%timeit pq.write_table(pa.Table.from_pandas(df), 'example.parquet')
print()
print("for comparison:")
print('example_pickle:')
%timeit df.to_pickle('example_pickle')
print('example_fp_parquet:')
%timeit fp.write('example_fp_parquet', df)
print('example_hdf:')
%timeit df.to_hdf('example_hdf', 'key_to_store', mode='w', table=True)
print()
print("pandas df from disk ##################################################")
print('example_feather:')
%timeit feather.read_feather('example_feather')
print('example_parquet:')
%timeit pq.read_table('example.parquet').to_pandas()
print("for comparison:")
print('example_pickle:')
%timeit pd.read_pickle('example_pickle')
print('example_fp_parquet:')
%timeit fp.ParquetFile('example_fp_parquet').to_pandas()
print('example_hdf:')
%timeit pd.read_hdf('example_hdf')
source
share