What is the difference between feather and parquet?

Both are columnar (disk) storage formats for use in data analysis systems. Both are integrated into Apache Arrow (a pyarrow package for python) and designed to match Arrow as a columnar analytic layer in memory.

How are both formats different?

If you always prefer a pen when working with pandas, when is this possible?

What are the options for using feather , more suitable than parquet and vice versa?


application

I found some tips here https://github.com/wesm/feather/issues/188 , but given the young age of this project, it may be a bit outdated.

Not a serious speed test, because I'm just dumping and loading the whole Dataframe, but to give you some impression, if you've never heard of formats before:

 # IPython    
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.feather as feather
import pyarrow.parquet as pq
import fastparquet as fp


df = pd.DataFrame({'one': [-1, np.nan, 2.5],
                   'two': ['foo', 'bar', 'baz'],
                   'three': [True, False, True]})

print("pandas df to disk ####################################################")
print('example_feather:')
%timeit feather.write_feather(df, 'example_feather')
# 2.62 ms ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print('example_parquet:')
%timeit pq.write_table(pa.Table.from_pandas(df), 'example.parquet')
# 3.19 ms ± 51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print()

print("for comparison:")
print('example_pickle:')
%timeit df.to_pickle('example_pickle')
# 2.75 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print('example_fp_parquet:')
%timeit fp.write('example_fp_parquet', df)
# 7.06 ms ± 205 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('example_hdf:')
%timeit df.to_hdf('example_hdf', 'key_to_store', mode='w', table=True)
# 24.6 ms ± 4.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
print()

print("pandas df from disk ##################################################")
print('example_feather:')
%timeit feather.read_feather('example_feather')
# 969 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('example_parquet:')
%timeit pq.read_table('example.parquet').to_pandas()
# 1.9 ms ± 5.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

print("for comparison:")
print('example_pickle:')
%timeit pd.read_pickle('example_pickle')
# 1.07 ms ± 6.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('example_fp_parquet:')
%timeit fp.ParquetFile('example_fp_parquet').to_pandas()
# 4.53 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('example_hdf:')
%timeit pd.read_hdf('example_hdf')
# 10 ms ± 43.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# pandas version: 0.22.0
# fastparquet version: 0.1.3
# numpy version: 1.13.3
# pandas version: 0.22.0
# pyarrow version: 0.8.0
# sys.version: 3.6.3
# example Dataframe taken from https://arrow.apache.org/docs/python/parquet.html
+3
source share
1 answer
  • The parquet format is intended for long-term storage, where Arrow is more suitable for short-term or ephemeral storage (Arrow may be more suitable for long-term storage after the release of version 1.0.0, since the binary format will be stable then)

  • Parquet is more expensive to write than Feather, because it contains more layers of coding and compression. A pen is an unmodified raw columnar arrow memory. We are likely to add simple compression for Peer in the future.

  • , RLE , Parquet Feather.

  • Parquet - , : Spark, Hive, Impala, AWS, BigQuery .. , , Parquet - .

, , , , , . 100 1 , , ., , http://wesmckinney.com/blog/python-parquet-multithreading/

,

+7

Source: https://habr.com/ru/post/1692793/


All Articles