Just try to come up with a programming model here. Scenario - I use Pub / Sub + Dataflow for tool analytics for a web forum. I have a stream of data coming from Pub / Sub that looks like this:
ID | TS | EventType 1 | 1 | Create 1 | 2 | Comment 2 | 2 | Create 1 | 4 | Comment
And I want to end the stream coming from Dataflow, which looks like this:
ID | TS | num_comments 1 | 1 | 0 1 | 2 | 1 2 | 2 | 0 1 | 4 | 2
I want the task that this convolution runs as a thread, with new values that are filled as new events arrive. My question is: where is the idiomatic place for the job to store the state for the current topic id and comment? Assuming threads can live for years. Current ideas:
- Write the “current” entry for the topic ID in BigTable and in the DoFn request, which includes the current comment for the topic ID. Even when I write this, I am not a fan.
- Use the side inputs in some way? This seems to be the answer, but if so, I don't quite understand.
- Set up a streaming job with a global window with a trigger that turns off every time it receives a recording and relies on Dataflow to save the entire panel history. (unlimited storage requirement?)
EDIT: just to clarify, I would not have problems implementing any of these three strategies or millions of other other ways to do this, I'm more interested in how best to do this with Dataflow. What will be the most resistant to failure, you need to double-check the history for backfill, etc. Etc.
EDIT2: There is currently an error with the data flow service in which updates fail if you add attachments to the smoothing transformation, which will mean that you will need to drop and restore any state accrued to the task if you make changes to the task which involves adding something to the smoothing operation.
source share