Generating unique identifiers in the hive

I am trying to create unique identifiers for each row of the table (30 million + rows).

  • using sequential numbers obviously doesn't work due to the parallel nature of Hadoop.
  • the built-in UDF files rand () and hash (rand (), unixtime ()) seem to create conflicts.

There should be an easy way to generate string ids, and I was wondering who has the solution.

  • My next step is simply to create a Java map reducing the job, to create a real hash string with a secure random + IP host + current time as a seed. but I suppose I will ask here before doing this;)
+6
source share
8 answers

Use UDF reflection to generate UUIDs.

reflect("java.util.UUID", "randomUUID") 
+18
source

Not sure if all this is useful, but here goes ...

Consider the native MapReduce analogue: if your input set is text-based, the Mapper key (and therefore a unique identifier) ​​will be the file name and its byte offset for each line.

When you upload data to Hive, if you can create an extra “column” that has this information, you will get your rowID for free. This is semantically meaningless, but also the approach you mentioned above.

+6
source

In response to the jtravaglini question, starting from 0.8.0, there are 2 built-in virtual Hive columns that can be used to generate a unique identifier:

 INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE 

Use like this:

 select concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE) as rowkey, ... ; ... OK hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:0 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:57 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:114 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:171 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:228 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:285 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:342 ... 

Or you can anonymize it with md5 or the like, here is a link to md5 UDF: https://gist.github.com/dataminelab/1050002
(note that the class name of the function is initcap 'Md5')

 select Md5(concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE)) as rowkey, ... 
+4
source

reflects ("java.util.UUID", "randomUUID")

I could not vote for another. I need a clean binary version, so I used this:

unhex (regexp_replace (reflection ('java.util.UUID', 'randomUUID'), '-', ''))

+1
source

Depending on the nature of your tasks and how often you plan to run them, using sequential numbers can be a reasonable alternative. You can implement UDF rank() as described in this other SO question .

0
source

Write a custom Mapper that stores a counter for each Map task and creates the union JobID () (derived from the MR API) + the current counter value as the row identifier for the row. Before checking the next line, increase the counter.

0
source

If you want to work with multiple mappers and with a large dataset, try using this UDF: https://github.com/manojkumarvohra/hive-hilo

It uses zookeeper as the central repository to maintain sequence state and generates unique incremental numerical values.

0
source

Use the ROW_NUMBER function to generate monotonically increasing integer identifiers.

 select ROW_NUMBER() OVER () AS id from t1; 

See https://community.hortonworks.com/questions/58405/how-to-get-the-row-number-for-particular-values-fr.html.

0
source

Source: https://habr.com/ru/post/951825/


All Articles