I am new to trying to figure out how we can rewrite a batch ETL process in Google Dataflow. I read some documents, launched some examples.
I suggest that the new ETL process be dependent on business events (i.e., the original PCollection). This will invoke the ETL process for this particular business object. The ETL process will retrieve the datasets from the source systems and then transfer these results (PCollections) to the next processing step. The processing steps will include various types of joins (including Cartesian and non-key joins, for example, date-bound).
So a couple of questions:
(1) Is the approach that I propose valid and effective? If not for what would be better, I have not seen any presentations about complex complex ETL processes in the real world using Google Dataflow, just simple scripts.
Are there any higher level ETL products that are better suited? I followed Spark and Flink for a while.
Our current ETL is moderately complex, although there are only about 30 basic tables (classic dimensions and facts of EDW) and ~ 1000 transformation steps. The source data is complex (approximately 150 Oracle tables).
(2) Complex non-key connections, how will they be processed?
I am obviously attracted to Google Dataflow because it is an API in the first place, and the parallel processing capabilities seem very good (we are asked to switch from batch to night incremental processing).
A good proven Dataflow example for this use case will really drive adoption ahead!
Thanks Mike S.
source share