Recently, I have been doing some research using Hadoop, Hive and Pig to transform data. As part of this, I noticed that the data file schema is not attached to files at all. Data files are just flat files (unless you use something like a SequenceFile). Each application that wants to work with these files has its own way of representing the layout of these files.
For example, I upload a file to HDFS and want to convert it using Pig. To work effectively with it, I need to specify the file scheme when loading data:
EMP = LOAD 'myfile' using PigStorage() as { first_name: string, last_name: string, deptno: int};
Now I know that when storing a file using PigStorage, the scheme can be written out separately, but in order to get the file in Pig first, it seems you need to specify the scheme.
If I want to work with the same file in Hive, I need to create a table and specify the schema too:
CREATE EXTERNAL TABLE EMP ( first_name string , last_name string , empno int) LOCATION 'myfile';
It seems to me that it is very fragile. If the file format changes slightly, then the scheme must be updated manually in each application. I'm sure I'm naive, but it doesn't make sense to store the schema with the data file? Thus, data is transferred between applications, and the barrier to using another tool will be lower, since you will not need to recode the scheme for each application.
So the question is: is there a way to specify a data file schema in Hadoop / HDFS or do I need to specify a schema for a data file in each application?