Is there a place to store data schemas in Hadoop?

Recently, I have been doing some research using Hadoop, Hive and Pig to transform data. As part of this, I noticed that the data file schema is not attached to files at all. Data files are just flat files (unless you use something like a SequenceFile). Each application that wants to work with these files has its own way of representing the layout of these files.

For example, I upload a file to HDFS and want to convert it using Pig. To work effectively with it, I need to specify the file scheme when loading data:

EMP = LOAD 'myfile' using PigStorage() as { first_name: string, last_name: string, deptno: int}; 

Now I know that when storing a file using PigStorage, the scheme can be written out separately, but in order to get the file in Pig first, it seems you need to specify the scheme.

If I want to work with the same file in Hive, I need to create a table and specify the schema too:

 CREATE EXTERNAL TABLE EMP ( first_name string , last_name string , empno int) LOCATION 'myfile'; 

It seems to me that it is very fragile. If the file format changes slightly, then the scheme must be updated manually in each application. I'm sure I'm naive, but it doesn't make sense to store the schema with the data file? Thus, data is transferred between applications, and the barrier to using another tool will be lower, since you will not need to recode the scheme for each application.

So the question is: is there a way to specify a data file schema in Hadoop / HDFS or do I need to specify a schema for a data file in each application?

+6
source share
3 answers

Looks like you're looking for Apache Avro . With Avro, your circuit is built into your data, so you can read it without worrying about circuit problems, and it greatly simplifies the evolution of the circuit.

The great thing about Avro is that it is fully integrated with Hadoop, and you can use it with a lot of Hadoop subprojects like Pig and Hive.

For example, with Pig you can do:

 EMP = LOAD 'myfile.avro' using AvroStorage(); 

I would advise you to read the documentation for AvroStorage in more detail.

You can also work with Avro with Hive, as described here , but I have not used it personally, but it should work the same.

+3
source

What do you need, HCatalog , which

"Apache HCatalog is a table and data warehouse management service created using Apache Hadoop.

It includes:

  • Providing a mechanism for the overall schema and data type.
  • Providing table abstraction so that users are not interested in where and how their data is stored.
  • Providing interoperability between data processing tools like Pig, Map Reduce and Hive. "

You can take a look at the โ€œdata flow exampleโ€ in the docs to see exactly the script you are talking about

+1
source

Apache Zebra seems to be a tool that can provide a common schema definition through mr, pig and hive. It has its own circuit shop. MR work can use the built-in TableStore to write to HDFS.

0
source

Source: https://habr.com/ru/post/946198/


All Articles