There are many options for storing data in the hive. You can use external storage, where Hive will simply transfer some data from another place, or you can create a stand-alone table from the beginning in the hive's storage . Input and output formats allow you to specify the initial data structure of these two types of tables or the physical storage of data. For your part, you will continue to work with the table using sql, but at a low level it will be either a text file or a sequence file, or an hbase table, or some other data structure.
InputFormat and OutputFormat - allows you to describe the original data structure so that Hive can correctly display it in a table
SerDe - represents a class that actually translates data from a table view into low-level input-output format structures and vice versa
Typically, your process will look like this: HDFS files → InputFileFormat → Deserializer → Row object → Serializer → OutputFileFormat → HDFS files
Saved as - indicates a storage format that includes input and output formats for new tables in Hive
These attributes can really affect performance, overall size, support for data schema evolution, or include features like ACID. You can follow the steps described in this article to see that everything works at a low level and get some general information about the most commonly used formats - https://oyermolenko.blog/2017/02/16/structuring-hadoop-data- through-hive-and-sql
source share