How to provide idempotence with DataFlow and Cloud Pub Sub?

I am wondering how best to use idempotence when using Cloud DataFlow and PubSub?

We currently have a system that processes and stores records in a MySQL database. I am interested in using DataFlow for some of our reports, but I would like to understand what I need to do so that I do not accidentally count twice (or more than double count) the same messages.

My confusion occurs in two parts: firstly, I only send messages once and, secondly, I guarantee that I process them only once.

My gut will look like this:

Whenever an event of interest to me is recorded in our MySQL database, convert it to a PubSub message and publish it to PubSub. Assuming success, write down the PubSub identifier that was returned with the MySQL record. That way, if it has a PubSub ID, I know that I sent it and I no longer need to send it. If publishing to PubSub failed, I know that I need to send it again. Things are good.

But if the MySQL write fails after the PubSub write succeeds, I can post the same message again to the pub sub, so I need something on the DataFlow side to handle both this case and the case where PubSub sends a message twice (according to https://cloud.google.com/pubsub/subscriber#guarantees ).

? AppEngine , , , , , DataFlow. , , ? DataFlow ?

+4
1

Dataflow ( idLabel) , . , MySQL. , Dataflow .

+6

Source: https://habr.com/ru/post/1628746/


All Articles