Pros / cons of streaming in BigQuery directly through Google Pub / Sub + Dataflow

We have the NodeJS API hosted by Google Kubernetes Engine, and we would like to start recording events in BigQuery.

I see three different ways to do this:

In this specific use case, we do not need to do any conversions and simply send events directly to the desired format. But later we can use other use cases where we need to synchronize tables from our main data warehouse (MySQL) to BQ for analytics, so maybe starting with Dataflow is worth it right away?

A few questions:

  • Option 1 (sending one event directly to BQ) seems simple if you don't have any conversions. Is it as fast and reliable as publishing in a Pub / Sub topic? I am mostly concerned about delay and error / duplication handling ( https://cloud.google.com/bigquery/troubleshooting-errors#streaming ). Maybe this is better done in a separate process?
  • For option 2, are there any β€œpresets” of the data stream that do not require you to write custom code when all you need to do is read from Pub / Sub + reliably in BQ without any conversions (maybe deduplication / error handling only)
  • Are there any flaws in regards to a simple user worker (option 3) that reads from Pub / Sub and then transfers to BQ and does all error processing / reprocessing, etc.?
+11
source share
3 answers

For option 2, yes, there is a preset called a template provided by Google that makes it easy to move data from PubSub to BigQuery without having to write any code.

You can learn more about how to use this template provided by Google, and others, in the Cloud Documentation Documentation .

+3
source

Another option is to export logs using the system log. Directly from the Stackdriver logging user interface, you can specify BigQuery (or other destinations) for your logs. Since your Node API works in Kubernetes, you just need to write messages to stdout and they will be automatically written to Stackdriver.

Link: https://cloud.google.com/logging/docs/export/configure_export_v2

+2
source

I looked at this, and it seems to me that the answers are a little lacking. Here is what I can tell you about the pros and cons of each approach:

  1. Writing a user program (via the Node BQ API or the user workflow) has several pitfalls when it comes to guarantees that are executed exactly once . In particular, if you write your own employee, you will need to do additional work to check the progress of the milestone and make sure that no elements have been discarded or duplicated in the event of runtime errors or the death of your workflow.

  2. If your requirements change (for example, performing BQ streaming inserts becomes too expensive), the Dataflow Java SDK supports any of the options without any problems: streaming inserts or cheaper execution of several loading tasks in BQ instead of streaming inserts; and it also handles multiple data sources well.

  3. The data stream provides automatic automatic scaling in the event of an increase in data volume.

With that in mind, I would say:

  • If your use case is relatively simple and everything is fine with very rare data points discarded when workers restart, then a user-written Node / Python application should help you.

  • If your use case only provides PubSub streaming to BQ, but you need to make sure the data is not deleted, check out the template provided by Andrew, which does just that.

  • If your use case is likely to be more complex than that, you can write your own pipeline (and use the template code as inspiration !).

0
source

Source: https://habr.com/ru/post/1274641/


All Articles