In response to the jtravaglini question, starting from 0.8.0, there are 2 built-in virtual Hive columns that can be used to generate a unique identifier:
INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE
Use like this:
select concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE) as rowkey, ... ; ... OK hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:0 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:57 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:114 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:171 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:228 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:285 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:342 ...
Or you can anonymize it with md5 or the like, here is a link to md5 UDF: https://gist.github.com/dataminelab/1050002
(note that the class name of the function is initcap 'Md5')
select Md5(concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE)) as rowkey, ...
source share