How can I generate unique key values ​​for records in a BigQuery table?

How to assign surrogate keys when inserting records into a BigQuery table? Something like using Sequence to generate unique values ​​or NextVal?

+6
source share
4 answers

Here we apply an approach that generates a unique identifier for an integer for each row, and the identifiers are sorted according to some value in the original datasaet, in this case the timestamp:

SELECT RANK() OVER(ORDER BY timestamp) unique_id, title FROM [publicdata:samples.wikipedia] LIMIT 1000 

An alternative is to randomly generate unique identifiers:

 SELECT RANK() OVER(ORDER BY random) unique_id, RAND() random, title FROM [publicdata:samples.wikipedia] LIMIT 1000 

To attach these values ​​during insertion, load the source data into the BigQuery table, then modify the code above to select from this table (instead of Wikipedia) and save the results.

+5
source

(Sorry, reputation is not enough to add comments to existing answers ...)

In what source and format do you upload data? If CSV or JSON is disconnected from GCS, you can link Michael's solution to our federated data sources ( https://cloud.google.com/bigquery/federated-data-sources ) to create a table and IDs in one operation, and not how loading and request.

+3
source

I do it like this:

 SELECT (ROW_NUMBER() OVER ()) + ( SELECT MAX(surrogate_key) FROM dimension_table ) AS surrogate_key, business_key, attribute1, attributen, CURRENT_DATE AS start_date, null as end_date, true AS is_current FROM source_table 
  • nb: last 3 elements are scd2 fields and new style SQL syntax is required to work

  • nb2: if you order BY in Row_number, BigQuery will probably be too hard a bug because ORDER BY cannot be parallelized

+3
source

If you want to generate surrogate key values ​​in BigQuery, then it is better to avoid the ROW_NUMBER OVER () option and its variants. To quote BigQuery's surrogate key message:

To implement ROW_NUMBER (), BigQuery must sort the values ​​in the root node of the execution tree, which is limited by the amount of memory in one execution node.

This always leads to problems with a small number of records.

There are two alternatives:

Option 1 - GENERATE_UUID ()

Since the surrogate key has no business value and is simply a unique key created for use in the data warehouse, you can simply generate it by calling the GENERATE_UUID() function in BigQuery. This gives you a universally unique UUID that you can use as a surrogate key value.

The downside is that this key will be 32 bytes instead of the 8-byte value of INT64. Therefore, if you have a huge number of records, this can increase the storage capacity of your data.

Option 2: create a unique hash

The second option is to use a hash function to generate a unique hash. This is a little trickier as you will need to find a combination of columns and / or a random other input to make sure you can never generate the same value twice.

Some hash functions also output a 32-byte value, so you will not save on storage, but the hash function FARM_FINGERPRINT () will output an INT64 value, which may save some storage. Thus, you can use options 1 and 2 to generate a unique integer surrogate key by doing the following: FARM_FINGERPRINT(GENERATE_UUID())

0
source

Source: https://habr.com/ru/post/1236182/


All Articles