Efficient data storage in Hive

Question

Efficient data storage in Hive

How can I efficiently store data in Hive, as well as store and retrieve compressed data in a bush? I am currently storing it as a TextFile. I went through a Bejoy article , and I found that LZO compression would be useful for storing files, and it could also be shared.

I have one HiveQL Select query that generates some output, and I store this output somewhere so that one of my Hive (quality) tables can use this data so that I can query this quality table.

Below is the quality table, in which I load the data from the SELECT query below, creating a section that I use to overwrite the quality table.

 create table quality (id bigint, total bigint, error bigint ) partitioned by (ds string) row format delimited fields terminated by '\t' stored as textfile location '/user/uname/quality' ; insert overwrite table quality partition (ds='20120709') SELECT id , count2 , coalesce(error, cast(0 AS BIGINT)) AS count1 FROM Table1;

So now I save it as a TextFile , should I do it as a Sequence file and start storing the data in the LZO compression format ? Or will a text file also be good here? Since I get several GB of data from the select query, which needs to be downloaded daily for the quality of the table.

So which way is better? Do I have to store the output in TextFile or SequenceFile format (LZO compression), so when I query the beehive quality table, I get brackets for the result. The query tool is faster.

Update: -

What if I store as a SequenceFile with block compression? As below -

 set mapred.output.compress=true; set mapred.output.compression.type=BLOCK; set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;

Do I need to install some other features to enable BLOCK Compression separately from above? And also I create a table as a SequenceFile format

Refresh again

Do I need to create a table as shown below? Or do some other changes need to be made to enable BLOCK compression with a sequence file?

 create table lipy ( buyer_id bigint, total_chkout bigint, total_errpds bigint ) partitioned by (dt string) row format delimited fields terminated by '\t' stored as sequencefile location '/apps/hdmi-technology/lipy' ;

+6

mapreduce hadoop hive lzo

ferhan Aug 1 '12 at 19:43

source share

1 answer

alexeipab · Accepted Answer · 2012-08-02T11:10:33+0000

I didn't use Hive much, but from experience with Hadoop and structured data, I got better performance from SequenceFiles with BLOCK compression. By default, row compression is used, but it is not as efficient as BLOCK compression when storing structured data, and rows are not particularly large. To enable it, I used mapred.output.compression.type = BLOCK

Efficient data storage in Hive

More articles: