How can I efficiently store data in Hive, as well as store and retrieve compressed data in a bush? I am currently storing it as a TextFile. I went through a Bejoy article , and I found that LZO compression would be useful for storing files, and it could also be shared.
I have one HiveQL Select query that generates some output, and I store this output somewhere so that one of my Hive (quality) tables can use this data so that I can query this quality table.
Below is the quality table, in which I load the data from the SELECT query below, creating a section that I use to overwrite the quality table.
create table quality (id bigint, total bigint, error bigint ) partitioned by (ds string) row format delimited fields terminated by '\t' stored as textfile location '/user/uname/quality' ; insert overwrite table quality partition (ds='20120709') SELECT id , count2 , coalesce(error, cast(0 AS BIGINT)) AS count1 FROM Table1;
So now I save it as a TextFile , should I do it as a Sequence file and start storing the data in the LZO compression format ? Or will a text file also be good here? Since I get several GB of data from the select query, which needs to be downloaded daily for the quality of the table.
So which way is better? Do I have to store the output in TextFile or SequenceFile format (LZO compression), so when I query the beehive quality table, I get brackets for the result. The query tool is faster.
Update: -
What if I store as a SequenceFile with block compression? As below -
set mapred.output.compress=true; set mapred.output.compression.type=BLOCK; set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;
Do I need to install some other features to enable BLOCK Compression separately from above? And also I create a table as a SequenceFile format
Refresh again
Do I need to create a table as shown below? Or do some other changes need to be made to enable BLOCK compression with a sequence file?
create table lipy ( buyer_id bigint, total_chkout bigint, total_errpds bigint ) partitioned by (dt string) row format delimited fields terminated by '\t' stored as sequencefile location '/apps/hdmi-technology/lipy' ;