I run several tests in storage formats available with Hive, and using Parquet and ORC as the main parameters. I enabled ORC once with standard compression and once with Snappy.
I read a lot of documents that say Parquet is better in time and space than ORC, but my tests are the opposite of the documents I passed.
Performs some details of my data.
Table A- Text File Format- 2.5GB Table B - ORC - 652MB Table C - ORC with Snappy - 802MB Table D - Parquet - 1.9 GB
Parquet was worse than compression for my table.
My tests with the above tables gave the following results.
Row count operation
Text Format Cumulative CPU - 123.33 sec Parquet Format Cumulative CPU - 204.92 sec ORC Format Cumulative CPU - 119.99 sec ORC with SNAPPY Cumulative CPU - 107.05 sec
Column operation sum
Text Format Cumulative CPU - 127.85 sec Parquet Format Cumulative CPU - 255.2 sec ORC Format Cumulative CPU - 120.48 sec ORC with SNAPPY Cumulative CPU - 98.27 sec
Column Average
Text Format Cumulative CPU - 128.79 sec Parquet Format Cumulative CPU - 211.73 sec ORC Format Cumulative CPU - 165.5 sec ORC with SNAPPY Cumulative CPU - 135.45 sec
Select 4 columns from a given range using the where clause
Text Format Cumulative CPU - 72.48 sec Parquet Format Cumulative CPU - 136.4 sec ORC Format Cumulative CPU - 96.63 sec ORC with SNAPPY Cumulative CPU - 82.05 sec
Does this mean that ORC is faster than Parquet? Or is there something I can do to improve performance with response time and compression ratio?
Thank!
hadoop hive parquet snappy orc
Rahul Sep 03 '15 at 10:45 2015-09-03 10:45
source share