I am wondering how best to use idempotence when using Cloud DataFlow and PubSub?
We currently have a system that processes and stores records in a MySQL database. I am interested in using DataFlow for some of our reports, but I would like to understand what I need to do so that I do not accidentally count twice (or more than double count) the same messages.
My confusion occurs in two parts: firstly, I only send messages once and, secondly, I guarantee that I process them only once.
My gut will look like this:
Whenever an event of interest to me is recorded in our MySQL database, convert it to a PubSub message and publish it to PubSub. Assuming success, write down the PubSub identifier that was returned with the MySQL record. That way, if it has a PubSub ID, I know that I sent it and I no longer need to send it. If publishing to PubSub failed, I know that I need to send it again. Things are good.
But if the MySQL write fails after the PubSub write succeeds, I can post the same message again to the pub sub, so I need something on the DataFlow side to handle both this case and the case where PubSub sends a message twice (according to https://cloud.google.com/pubsub/subscriber#guarantees ).
? AppEngine , , , , , DataFlow. , , ? DataFlow ?