BigQuery: how to smooth a repeating structured property imported from a data warehouse

Question

BigQuery: how to smooth a repeating structured property imported from a data warehouse

dear all

I started using BigQuery to analyze data in the GAE data warehouse this month. First, I export the data through the GAE’s Datastore Admin page to Google Cloud Storage. And then I import data from Google Cloud Storage into BigQuery. It works very smoothly, with the exception of a repeating structured property. I expected that the imported record should be in the format:

parent:"James", children: [{ name: "name1", age: 5, gender: "M" }, { name: "name2", age: 50, gender: "F" }, { name: "name3", age: 33, gender: "M" }, ]

I know how to smooth data in the above format. But the actual data format in BigQuery looks like this:

  parent: "James", children.name:["name1", "name2", "name3"], children.age:[5, 50, 33], children.gender:["M", "F", "M"],

I am wondering if it is possible to smooth the above data in BigQuery for further analysis. The ideal result table format in my mind:

  parentName, children.name, children.age, children.gender James, name1, 5, "M" James, name2, 50, "F" James, name3, 33, "M"

Hooray!

+4

google-app-engine google-cloud-datastore google-bigquery

James gan Jun 21 '13 at 5:29

source share

2 answers

You can use the "large query results" function to create a new flattened table. Unfortunately, the syntax is terrible. The basic principle is that you want to smooth each field and keep the position, then filter where the position will be the same. Try something like:

 SELECT parentName, children.name, children.age, children.gender, position(children.name) as name_pos, position(children.age) as age_pos, position(children.gender) as gender_pos, FROM table SELECT parent, children.name, children.age, children.gender, pos FROM ( SELECT parent, children.name, children.age, children.gender, gender_pos, pos FROM ( FLATTEN(( SELECT parent, children.name, children.age, children.gender, pos, POSITION(children.gender) as gender_pos FROM ( SELECT parent, children.name, children.age, children.gender, pos, FROM ( FLATTEN(( SELECT parent, children.name, children.age, children.gender, pos, POSITION(children.age) AS age_pos FROM ( FLATTEN(( SELECT parent, children.name, children.age, children.gender, POSITION(children.name) AS pos FROM table ), children.name))), children.age)) WHERE age_pos = pos)), children.gender))) WHERE gender_pos = pos;

To allow large results, if you use the BigQuery user interface, you must click the "advanced options" button, specify the destination table and check the "allow large results" flag.

Please note that if your data is stored as an entity with a nested record that looks like {name, age, gender}, we must convert this to a nested record in bigquery instead of parallel arrays. I will see why this is happening.

+1

Jordan Tigani Jun 21 '13 at 15:12

source share

Mikhail Berlyant · Accepted Answer · 2016-05-19T00:37:25+0000

With the recently introduced BigQuery Standard SQL , everything is much better!
Try below (make sure to clear the Use Legacy SQL checkbox in the Show Options section)

 WITH parents AS ( SELECT "James" AS parentName, STRUCT( ["name1", "name2", "name3"] AS name, [5, 50, 33] AS age, ["M", "F", "M"] AS gender ) AS children ) SELECT parentName, childrenName, childrenAge, childrenGender FROM parents, UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name, UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age, UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender WHERE pos_name = pos_age AND pos_name = pos_gender

Here - the source table - parents - below the data

with appropriate schema like

 [{ "parentName": "James", "children": { "name": ["name1", "name2", "name3"], "age": ["5", "50", "33" ], "gender": ["M", "F", "M"] } }]

and output is

Note: the above is based solely on what I see in the original question and most likely needs to be adjusted to take into account any specific needs that you have. Hope this helps in terms of direction to go and where to start!

Added:

Above Query uses row-based CROSS JOINS, which means that all options are for the same parent first, and then WHERE clauses filter out the “wrong” ones.

Unlike below, use INNER JOIN to eliminate this "side effect"

 WITH parents AS ( SELECT "James" AS parentName, STRUCT( ["name1", "name2", "name3"] AS name, [5, 50, 33] AS age, ["M", "F", "M"] AS gender ) AS children ) SELECT parentName, childrenName, childrenAge, childrenGender FROM parents, UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name JOIN UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age ON pos_name = pos_age JOIN UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender ON pos_age = pos_gender

Intuitively, I would expect the second version to be slightly more efficient for large tables

BigQuery: how to smooth a repeating structured property imported from a data warehouse

More articles: