What is the difference between "InputFormat, OutputFormat" and "Saved As" in Hive?

Im new to Bigdata and is currently studying hive. I understood the concept of InputFormat and OutputFormat in Hive as part of SerDe. I also realized that “Saved As” is used to store a file in a specific format, like InputFormat. But I do not understand what is the significant difference between using "InputFormat, OutputFormat" and "Saved As".

Any help is appreciated.

+2
source share
1 answer

There are many options for storing data in the hive. You can use external storage, where Hive will simply transfer some data from another place, or you can create a stand-alone table from the beginning in the hive's storage . Input and output formats allow you to specify the initial data structure of these two types of tables or the physical storage of data. For your part, you will continue to work with the table using sql, but at a low level it will be either a text file or a sequence file, or an hbase table, or some other data structure.

InputFormat and OutputFormat - allows you to describe the original data structure so that Hive can correctly display it in a table

SerDe - represents a class that actually translates data from a table view into low-level input-output format structures and vice versa

Typically, your process will look like this: HDFS files → InputFileFormat → Deserializer → Row object → Serializer → OutputFileFormat → HDFS files

Saved as - indicates a storage format that includes input and output formats for new tables in Hive

These attributes can really affect performance, overall size, support for data schema evolution, or include features like ACID. You can follow the steps described in this article to see that everything works at a low level and get some general information about the most commonly used formats - https://oyermolenko.blog/2017/02/16/structuring-hadoop-data- through-hive-and-sql

+10
source

Source: https://habr.com/ru/post/1268684/


All Articles