The difference between "Saved as InputFormat, OutputFormat" and "Saved as" in Hive

The problem is when running show create table and then running the resulting create table statement if the table is ORC.

Using show create table , you get the following:

 STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 

But if you create a table with these sentences, you will get a casting error when choosing. Error:

Failed to throw java.io.IOException: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot throw org.apache.hadoop.io.BinaryComparable


To fix this, simply change the create table statement to STORED AS ORC

But, as the answer said in a similar question: What is the difference between 'InputFormat, OutputFormat' and 'Stored as' in Hive? .

I can not understand the reason.

+7
source share
2 answers

STORED AS means 3 things:

  • SERDE
  • INPUTFORMAT
  • OUTPUTFORMAT

You only defined the last 2, leaving SERDE to define hive.default.serde

hive.default.serde
Default value: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Posted in: Hive 0.14 with HIVE-5976
By default, SerDe Hive will use for storage formats that do not specify SerDe.
Storage formats that do not currently specify SerDe include "TextFile, RcFile".

Demo

hive.default.serde

 set hive.default.serde; 

 hive.default.serde=org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 

STORED AS ORC

 create table mytable (i int) stored as orc; show create table mytable; 

Note that SERDE has the value 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

 CREATE TABLE `mytable`( `i` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'file:/home/cloudera/local_db/mytable' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 'numFiles'='0', 'numRows'='0', 'rawDataSize'='0', 'totalSize'='0', 'transient_lastDdlTime'='1496982059') 

STORED AS INPUTFORMAT ... OUTPUTFORMAT ...

 create table mytable2 (i int) STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' ; show create table mytable2 ; 

Note that SERDE has the value 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

 CREATE TABLE `mytable2`( `i` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'file:/home/cloudera/local_db/mytable2' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 'numFiles'='0', 'numRows'='0', 'rawDataSize'='0', 'totalSize'='0', 'transient_lastDdlTime'='1496982426') 
+8
source

You can specify INPUTFORMAT , OUTPUTFORMAT , SERDE in STORED AS when creating the table. Hive lets you separate your recording format from your file format. You can provide custom classes for INPUTFORMAT , OUTPUTFORMAT , SERDE . More details: http://www.dummies.com/programming/big-data/hadoop/defining-table-record-formats-in-hive/

Alternatively, you can simply write STORED AS ORC or STORED AS TEXTFILE , for example. STORED AS ORC already takes care of INPUTFORMAT , OUTPUTFORMAT and SERDE . This eliminates the need to write down these long fully INPUTFORMAT Java class names for INPUTFORMAT , OUTPUTFORMAT , SERDE . Just STORED AS ORC .

+3
source

Source: https://habr.com/ru/post/1268675/


All Articles