Generating unique identifiers in the hive

Question

Generating unique identifiers in the hive

I am trying to create unique identifiers for each row of the table (30 million + rows).

using sequential numbers obviously doesn't work due to the parallel nature of Hadoop.
the built-in UDF files rand () and hash (rand (), unixtime ()) seem to create conflicts.

There should be an easy way to generate string ids, and I was wondering who has the solution.

My next step is simply to create a Java map reducing the job, to create a real hash string with a secure random + IP host + current time as a seed. but I suppose I will ask here before doing this;)

+6

identifier hash hadoop hive

user1745713 Aug 15 '13 at 17:59

source share

8 answers

Not sure if all this is useful, but here goes ...

Consider the native MapReduce analogue: if your input set is text-based, the Mapper key (and therefore a unique identifier) will be the file name and its byte offset for each line.

When you upload data to Hive, if you can create an extra “column” that has this information, you will get your rowID for free. This is semantically meaningless, but also the approach you mentioned above.

+6

jtravaglini Aug 15 '13 at 18:04

source share

In response to the jtravaglini question, starting from 0.8.0, there are 2 built-in virtual Hive columns that can be used to generate a unique identifier:

 INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE

Use like this:

 select concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE) as rowkey, ... ; ... OK hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:0 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:57 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:114 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:171 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:228 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:285 hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:342 ...

Or you can anonymize it with md5 or the like, here is a link to md5 UDF: https://gist.github.com/dataminelab/1050002
(note that the class name of the function is initcap 'Md5')

 select Md5(concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE)) as rowkey, ...

+4

user3287506 Feb 08 '14 at 16:06

source share

reflects ("java.util.UUID", "randomUUID")

I could not vote for another. I need a clean binary version, so I used this:

unhex (regexp_replace (reflection ('java.util.UUID', 'randomUUID'), '-', ''))

+1

Steven lowenthal Mar 15 '14 at 10:02

source share

Depending on the nature of your tasks and how often you plan to run them, using sequential numbers can be a reasonable alternative. You can implement UDF rank() as described in this other SO question .

0

cabad Aug 15 '13 at 18:32

source share

Write a custom Mapper that stores a counter for each Map task and creates the union JobID () (derived from the MR API) + the current counter value as the row identifier for the row. Before checking the next line, increase the counter.

0

Rangarajan vasudevan Nov 24 '14 at 9:11

source share

If you want to work with multiple mappers and with a large dataset, try using this UDF: https://github.com/manojkumarvohra/hive-hilo

It uses zookeeper as the central repository to maintain sequence state and generates unique incremental numerical values.

0

Manoj kumar vohra Feb 15 '17 at 9:43

source share

Use the ROW_NUMBER function to generate monotonically increasing integer identifiers.

 select ROW_NUMBER() OVER () AS id from t1;

See https://community.hortonworks.com/questions/58405/how-to-get-the-row-number-for-particular-values-fr.html.

0

ruhong Mar 29 '19 at 2:19

source share

Carter shanklin · Accepted Answer · 2013-08-18T04:54:15+0000

Use UDF reflection to generate UUIDs.

reflect("java.util.UUID", "randomUUID")

Generating unique identifiers in the hive

More articles: