I am currently working on POC and mainly focus on Dataflow for ETL processing. I created a pipeline using the Dataflow 2.1 Java Beam API , and it only takes 3-4 minutes to initialize, and it also takes 1-2 minutes to complete each time it starts. However, the actual conversion (ParDo) takes less than a minute. Moreover, I tried to complete the tasks by completing various approaches,
- Running a job on the local computer
- Job execution remotely on GCP
- Run a job using a data flow template
But it seems that all of the above methods consume more or less the same time for initialization and completion. Thus, this is a bottleneck for POC, as we intend to launch hundreds of jobs every day.
I am looking for a way to share the initialization / completion time of all tasks so that it can be a one-time activity or any other approaches to reduce time.
Thanks in advance!
source
share