How to deal with Kafka cluster failure

We are going to introduce the Kafka Publish Subscribe system.

Now, in the worst of the worst cases - if all the Kafka eyebrows for a given topic are omitted - what happens?

I tried this ... the publisher found it after the default timeout to retrieve metadata and throws an exception if it failed.

In this case, we can track the exception and restart Publisher after committing Kafka.

But what about consumers - they don't seem to get any exceptions as soon as Kafka comes down. We simply cannot ask "all" consumers to restart their systems. Any better way to solve this problem?

+5
source share
2 answers

But what about consumers - they don't seem to get any exceptions as soon as Kafka comes down. We simply cannot ask "all" consumers to restart their systems. Any better way to solve this problem?

Yes, the consumer will have no exceptions, and the behavior will work as designed. However, you do not need to restart all consumers, just make sure that in your logic the user regularly calls the poll() method. The consumer is designed in such a way that it does not work, even if there is no cluster. Consider the following steps to understand what will actually happen:

1: All clusters are down, no active cluster.

2: consumer.poll(timeout) // This will be called form you portion of code

3: Inside the poll() call to the KafkaConsumer.java method, the following sequence of calls will take place.

 poll() --> pollOnce() --> ensureCoordinatorKnown() --> awaitMetaDataUpdate() 

I highlighted the calls of the main methods that will be called after performing logical checks inside. Now, at this point, your consumer will wait until the cluster is back up.

4: Cluster again or restarted

5: The consumer will be notified and it will start working again, as usual, before the cluster leaves.

Note. - The consumer will begin to receive messages from the last commit offset, the received message will not be duplicated.

The described behavior is valid for version (version 0.9.x)

+4
source

If a consumer (version 0.9.x) polls and the cluster leaves, it should receive the following exception

 java.net.ConnectException: Connection refused 

You can continue polling until the cluster returns, there is no need to restart the user, he will restore the connection.

+2
source

Source: https://habr.com/ru/post/1244330/


All Articles