Integrated connection to Google data stream

I am new to trying to figure out how we can rewrite a batch ETL process in Google Dataflow. I read some documents, launched some examples.

I suggest that the new ETL process be dependent on business events (i.e., the original PCollection). This will invoke the ETL process for this particular business object. The ETL process will retrieve the datasets from the source systems and then transfer these results (PCollections) to the next processing step. The processing steps will include various types of joins (including Cartesian and non-key joins, for example, date-bound).

So a couple of questions:

(1) Is the approach that I propose valid and effective? If not for what would be better, I have not seen any presentations about complex complex ETL processes in the real world using Google Dataflow, just simple scripts.

Are there any higher level ETL products that are better suited? I followed Spark and Flink for a while.

Our current ETL is moderately complex, although there are only about 30 basic tables (classic dimensions and facts of EDW) and ~ 1000 transformation steps. The source data is complex (approximately 150 Oracle tables).

(2) Complex non-key connections, how will they be processed?

I am obviously attracted to Google Dataflow because it is an API in the first place, and the parallel processing capabilities seem very good (we are asked to switch from batch to night incremental processing).

A good proven Dataflow example for this use case will really drive adoption ahead!

Thanks Mike S.

+5
source share
1 answer

It looks like Dataflow will be a good fit. We allow you to write a pipeline that accepts PCollection business events and performs ETL. The pipeline can be either periodic (performed periodically) or streaming (executed whenever input data arrives).

The various joins, for the most part, are relatively pronounced in Dataflow. For a Cartesian product, you can look at the side inputs so that the contents of the PCollection available as input for processing each element in another PCollection .

You can also use GroupByKey or CoGroupByKey to implement unions. These flattened multiple inputs allow you to access all values ​​with the same key in one place. You can also use Combine.perKey to compute associative and commutative combinations of all key-related elements (e.g. SUM, MIN, MAX, AVERAGE, etc.).

Date-based joins sound as if they are suitable for windowing , which allows you to write a pipeline that consumes data windows (e.g. hourly windows, daily windows, 7-day windows that crawl every day, etc.).


Edit: Specify GroupByKey and CoGroupByKey .

+6
source

Source: https://habr.com/ru/post/1241644/


All Articles