Parquet vs ORC vs ORC with Snappy

I run several tests in storage formats available with Hive, and using Parquet and ORC as the main parameters. I enabled ORC once with standard compression and once with Snappy.

I read a lot of documents that say Parquet is better in time and space than ORC, but my tests are the opposite of the documents I passed.

Performs some details of my data.

Table A- Text File Format- 2.5GB Table B - ORC - 652MB Table C - ORC with Snappy - 802MB Table D - Parquet - 1.9 GB 

Parquet was worse than compression for my table.

My tests with the above tables gave the following results.

Row count operation

 Text Format Cumulative CPU - 123.33 sec Parquet Format Cumulative CPU - 204.92 sec ORC Format Cumulative CPU - 119.99 sec ORC with SNAPPY Cumulative CPU - 107.05 sec 

Column operation sum

 Text Format Cumulative CPU - 127.85 sec Parquet Format Cumulative CPU - 255.2 sec ORC Format Cumulative CPU - 120.48 sec ORC with SNAPPY Cumulative CPU - 98.27 sec 

Column Average

 Text Format Cumulative CPU - 128.79 sec Parquet Format Cumulative CPU - 211.73 sec ORC Format Cumulative CPU - 165.5 sec ORC with SNAPPY Cumulative CPU - 135.45 sec 

Select 4 columns from a given range using the where clause

 Text Format Cumulative CPU - 72.48 sec Parquet Format Cumulative CPU - 136.4 sec ORC Format Cumulative CPU - 96.63 sec ORC with SNAPPY Cumulative CPU - 82.05 sec 

Does this mean that ORC is faster than Parquet? Or is there something I can do to improve performance with response time and compression ratio?

Thank!

+68
hadoop hive parquet snappy orc
Sep 03 '15 at 10:45
source share
6 answers

I would say that both of these formats have their advantages.

Parquet may be better if you have a lot of embedded data, because it stores its elements in a tree, as Google Dremel does ( see here ).
Apache ORC might be better if your file structure is flattened.

And, as far as I know, parquet does not yet support indexes. ORC comes with a lightweight index, and after Hive 0.14, an additional Bloom filter, which can be useful for better query response time, especially when it comes to summation operations.

The default compression for parquet is SNAPPY. Do tables A - B - C and D contain the same dataset? If so, then there seems to be something obscure about this when it is only compressed to 1.9 GB.

+36
Sep 07 '15 at 22:02
source share

You see this because:

  • There is a vectorized ORC reader in the hive, but without a vectorized parquet reader.

  • Spark has a vectorized parquet reader and without a vectorized ORC reader.

  • Spark works best with parquet, hive works best with ORC.

I saw similar differences when working with ORC and Parquet with Spark.

Vectorization means that lines are decoded in batches, which greatly improves memory locality and cache usage.

(fixed with Hive 2.0 and Spark 2.1)

+31
May 05 '17 at 1:06 pm
source share

We conducted several tests comparing different file formats (Avro, JSON, ORC and Parquet) in different use cases.

https://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet

All data is publicly available, and the reference code is open source at:

https://github.com/apache/orc/tree/branch-1.4/java/bench

+5
Dec 03 '17 at 17:31
source share

The two biggest considerations for ORC over a hive park are:

Many of the performance improvements introduced by the Stinger initiative depend on the features of the ORC format, including the block level index for each column. This leads to a potentially more efficient I / O, allowing Hive to skip reading entire blocks of data if it determines predicate values ​​that aren't there. The cost optimizer also has the ability to examine the column level metadata present in ORC files to generate the most efficient graph.

ACID operations are only possible when using ORC as the file format.

A few considerations for Parquet over ORC in Spark: 1) Simple creation of Dataframes into sparks. No need to specify schemes. 2) Works with highly embedded data.

Sparkle and parquet - a good combination

+2
Jun 02 '17 at 7:44
source share

Both of them have their advantages. We use Parquet together with Hive and Impala, but we just wanted to point out several advantages of ORC compared to Parquet: during long queries, when Hive queries ORC tables, GC is called about 10 times less often . It may be nothing for many projects, but it can be crucial for others.

ORC also takes much less time when you need to select just a few columns from a table. Some other requests, especially with connections, also take less time due to the execution of a vectorized request that is not available to Parquet.

In addition, ORC compression is sometimes a bit random, while Parquet compression is much more consistent. It looks like when an ORC table has many numeric columns - it also does not shrink. This affects both zlib compression and snappy.

+2
Jan 09 '18 at 18:44
source share

Both parquet and ORC have their advantages and disadvantages. But I'm just trying to follow a simple rule of thumb - "How much is your data and how many columns are there . " If you follow Google Dremel, you can find out how parquet is designed. They use a hierarchical tree structure to store data. More attachment deep into the tree.

But ORC is for flat file storage. Thus, if your data is reduced with fewer columns, you can use ORC, otherwise the parquet will suit you. Flat data compression works great in ORC.

We did some comparative analysis with a larger flat file, converted it to a spark Dataframe and saved it in parquet and ORC format in S3 and made a request using ** Redshift-Spectrum **.

 Size of the file in parquet: ~7.5 GB and took 7 minutes to write Size of the file in ORC: ~7.1. GB and took 6 minutes to write Query seems faster in ORC files. 

Soon we will conduct a comparative analysis for the embedded data and update the results here.

0
Jan 23 '19 at 20:47
source share



All Articles