I have a pipeline that reads from Pubsub themes (ending in minutes) and writes processed results to BigQuery. I would like the tables to be distributed over time, as well as some key to the data itself. BigQueryIO provides options for fragments according to the timestamp of the window, but I do not think that it provides any opportunity to outline the tables with some key from the input collection itself. If I missed any alternative for this, please let me know.
To solve this problem, (Option 1), I decided to trick the original Pubsub theme with the same key and, thus, configure the pipeline to read from several sources and process them on separate branches and write each branch result to the BigQuery section by time stamp windows that seems to work. What I would like to know is that the intermediate stage of processing in the data stream can be an agnostic of the source or receiver in my case (option 2), will it make the work more efficient (in terms of resources and time) if I I’ll continue to use one Pubsub theme and add an extra transformation before the BigQuery recording step to separate the collections and then write to BigQuery.
Option - 1 + Put less load on Pubsub during read / write, since even a combined message can fit in several hundred KB - The read step and intermediate processing are performed in separate pipelines (which may be inefficient for data flow)
Option - 2 + Pipeline clean - An additional step to a section that also reads the same collection as many times as the number of sections that we have - but the number of elements in the collection and the sections themselves is quite small - so this should not be more a serious problem.
I think option 2 makes more sense when reading through the Pipeline Design Principles , but I still want to clarify what I'm doing right.