Pros / cons of streaming in BigQuery directly through Google Pub / Sub + Dataflow

Question

Pros / cons of streaming in BigQuery directly through Google Pub / Sub + Dataflow

We have the NodeJS API hosted by Google Kubernetes Engine, and we would like to start recording events in BigQuery.

I see three different ways to do this:

Paste each event directly into BigQuery using the Node BigQuery SDK in the API (as described here in the section "Streaming Insert Examples": https://cloud.google.com/bigquery/streaming-data-into-bigquery or here: https: / /github.com/googleapis/nodejs-bigquery/blob/7d7ead644e1b9fe8428462958dbc9625fe6c99c8/samples/tables.js#L367 )
Publish each event in a cloud pub / drip, and then write a data stream pipeline for streaming in BigQuery (in Java or Python it only seems), for example here https://blog.doit-intl.com/replacing-mixpanel-with- bigquery-dataflow-and-kubernetes-b5f844710674 or here https://github.com/bomboradata/pubsub-to-bigquery
Publish each event in the Pub / Sub topic from the API, but instead of Dataflow, use a custom workflow that subscribes to the Pub / Sub topic on one side and passes to the BQ on the other , Like here: https://github.com/GoogleCloudPlatform/kubernetes- bigquery-python / blob / master / pubsub / pubsub-pipe-image / pubsub-to-bigquery.py or here: https://github.com/mchon89/Google_PubSub_BigQuery/blob/master/pubsub_to_bigquery.py

In this specific use case, we do not need to do any conversions and simply send events directly to the desired format. But later we can use other use cases where we need to synchronize tables from our main data warehouse (MySQL) to BQ for analytics, so maybe starting with Dataflow is worth it right away?

A few questions:

Option 1 (sending one event directly to BQ) seems simple if you don't have any conversions. Is it as fast and reliable as publishing in a Pub / Sub topic? I am mostly concerned about delay and error / duplication handling ( https://cloud.google.com/bigquery/troubleshooting-errors#streaming ). Maybe this is better done in a separate process?
For option 2, are there any “presets” of the data stream that do not require you to write custom code when all you need to do is read from Pub / Sub + reliably in BQ without any conversions (maybe deduplication / error handling only)
Are there any flaws in regards to a simple user worker (option 3) that reads from Pub / Sub and then transfers to BQ and does all error processing / reprocessing, etc.?

+11

google-bigquery google-cloud-dataflow google-cloud-pubsub

renaudg Jan 11 '18 at 18:03

source share

3 answers

Andrew Mo · Answer 1 · 2018-01-11T20:29:38+0000

For option 2, yes, there is a preset called a template provided by Google that makes it easy to move data from PubSub to BigQuery without having to write any code.

You can learn more about how to use this template provided by Google, and others, in the Cloud Documentation Documentation .

Andrew Nguonly · Answer 2 · 2018-01-12T06:59:52+0000

Another option is to export logs using the system log. Directly from the Stackdriver logging user interface, you can specify BigQuery (or other destinations) for your logs. Since your Node API works in Kubernetes, you just need to write messages to stdout and they will be automatically written to Stackdriver.

Link: https://cloud.google.com/logging/docs/export/configure_export_v2

Pablo · Answer 3 · 2018-12-12T23:59:14+0000

I looked at this, and it seems to me that the answers are a little lacking. Here is what I can tell you about the pros and cons of each approach:

Writing a user program (via the Node BQ API or the user workflow) has several pitfalls when it comes to guarantees that are executed exactly once . In particular, if you write your own employee, you will need to do additional work to check the progress of the milestone and make sure that no elements have been discarded or duplicated in the event of runtime errors or the death of your workflow.
If your requirements change (for example, performing BQ streaming inserts becomes too expensive), the Dataflow Java SDK supports any of the options without any problems: streaming inserts or cheaper execution of several loading tasks in BQ instead of streaming inserts; and it also handles multiple data sources well.
The data stream provides automatic automatic scaling in the event of an increase in data volume.

With that in mind, I would say:

If your use case is relatively simple and everything is fine with very rare data points discarded when workers restart, then a user-written Node / Python application should help you.
If your use case only provides PubSub streaming to BQ, but you need to make sure the data is not deleted, check out the template provided by Andrew, which does just that.
If your use case is likely to be more complex than that, you can write your own pipeline (and use the template code as inspiration !).

Pros / cons of streaming in BigQuery directly through Google Pub / Sub + Dataflow

More articles: