How to reduce initialization and shutdown time in Google Dataflow?

I am currently working on POC and mainly focus on Dataflow for ETL processing. I created a pipeline using the Dataflow 2.1 Java Beam API , and it only takes 3-4 minutes to initialize, and it also takes 1-2 minutes to complete each time it starts. However, the actual conversion (ParDo) takes less than a minute. Moreover, I tried to complete the tasks by completing various approaches,

  • Running a job on the local computer
  • Job execution remotely on GCP
  • Run a job using a data flow template

But it seems that all of the above methods consume more or less the same time for initialization and completion. Thus, this is a bottleneck for POC, as we intend to launch hundreds of jobs every day.

I am looking for a way to share the initialization / completion time of all tasks so that it can be a one-time activity or any other approaches to reduce time.

Thanks in advance!

+4
source share
1 answer

, , . , , .. , .

+1

Source: https://habr.com/ru/post/1687377/


All Articles