The difference between "Saved as InputFormat, OutputFormat" and "Saved as" in Hive

Question

The difference between "Saved as InputFormat, OutputFormat" and "Saved as" in Hive

The problem is when running show create table and then running the resulting create table statement if the table is ORC.

Using show create table , you get the following:

 STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

But if you create a table with these sentences, you will get a casting error when choosing. Error:

Failed to throw java.io.IOException: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot throw org.apache.hadoop.io.BinaryComparable

To fix this, simply change the create table statement to STORED AS ORC

But, as the answer said in a similar question: What is the difference between 'InputFormat, OutputFormat' and 'Stored as' in Hive? .

I can not understand the reason.

+7

hadoop hive hiveql orc

Jason Jun 08 '17 at 19:03

source share

2 answers

You can specify INPUTFORMAT , OUTPUTFORMAT , SERDE in STORED AS when creating the table. Hive lets you separate your recording format from your file format. You can provide custom classes for INPUTFORMAT , OUTPUTFORMAT , SERDE . More details: http://www.dummies.com/programming/big-data/hadoop/defining-table-record-formats-in-hive/

Alternatively, you can simply write STORED AS ORC or STORED AS TEXTFILE , for example. STORED AS ORC already takes care of INPUTFORMAT , OUTPUTFORMAT and SERDE . This eliminates the need to write down these long fully INPUTFORMAT Java class names for INPUTFORMAT , OUTPUTFORMAT , SERDE . Just STORED AS ORC .

+3

leftjoin Jun 09 '17 at 9:47

source share

David דודו Markovitz · Accepted Answer · 2017-06-09T04:37:36+0000

STORED AS means 3 things:

SERDE
INPUTFORMAT
OUTPUTFORMAT

You only defined the last 2, leaving SERDE to define hive.default.serde

hive.default.serde
Default value: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Posted in: Hive 0.14 with HIVE-5976
By default, SerDe Hive will use for storage formats that do not specify SerDe.
Storage formats that do not currently specify SerDe include "TextFile, RcFile".

Demo

hive.default.serde

 set hive.default.serde;

 hive.default.serde=org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

STORED AS ORC

 create table mytable (i int) stored as orc; show create table mytable;

Note that SERDE has the value 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

 CREATE TABLE `mytable`( `i` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'file:/home/cloudera/local_db/mytable' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 'numFiles'='0', 'numRows'='0', 'rawDataSize'='0', 'totalSize'='0', 'transient_lastDdlTime'='1496982059')

STORED AS INPUTFORMAT ... OUTPUTFORMAT ...

 create table mytable2 (i int) STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' ; show create table mytable2 ;

Note that SERDE has the value 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

 CREATE TABLE `mytable2`( `i` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'file:/home/cloudera/local_db/mytable2' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 'numFiles'='0', 'numRows'='0', 'rawDataSize'='0', 'totalSize'='0', 'transient_lastDdlTime'='1496982426')

The difference between "Saved as InputFormat, OutputFormat" and "Saved as" in Hive

Demo

More articles: