How to efficiently handle kafka publishing failure

I use Kafka, and we have a precedent for creating a fault-tolerant system where not even a single message should be missed. So here is the problem: If publishing to Kafka fails for any reason (ZooKeeper down, Kafka broker down, etc.), how can we safely process these messages and play them as soon as everything is restored again. Again, as I said, we cannot afford even one message failure. In another case, we must also know at any given time how many messages could not be published to Kafka for any reason, that is, something like a counter function, and now these messages need to be re-published again.

One solution is to push these messages to some database (for example, Cassandra, where recordings are very fast, but we also need counter functions, and I believe that the Cassandra counter function is not so useful, and we we don’t want to use it). which can handle such a load, and also provide us with a counter that is very accurate.

This question is more related to the prospect of architecture, and then what technology to use to make this happen.

PS: We process some, for example, 3000TPS. Thus, if a system crashes, these failed messages can grow very quickly in a very short time. We use java based frameworks.

Thanks for your help!

+6
source share
2 answers

The reason Kafka was built in a distributed, fault-tolerant manner is to handle the problems just like yours, multiple instances of the core components should avoid service interruptions. To avoid using Zookeeper, deploy at least 3 instances of Zookeepers (if it is in AWS, deploy them in Availability Zones). To avoid broker crashes, deploy multiple brokers and make sure you list multiple brokers in your bootstrap.servers manufacturer. To ensure that the Kafka cluster wrote your message to a trusted homestead, make sure the acks=all property is set in the manufacturer. This will confirm that the client writes when all synchronized copies confirm receipt of the message (due to bandwidth). You can also set priority limits to ensure that if a record to the broker starts backing up, you can catch the exception and handle it and possibly try again.

Using Cassandra (another well-thought-out distributed, fault-tolerant system) to β€œstage” your records does not seem to add any reliability to your architecture, but increases complexity, plus Cassandra was not written as a message queue for a message queue, I would avoid this .

Properly configured, Kafka must be available to process all of your messages and provide appropriate warranties.

+4
source

Chris has already talked about how to maintain system resiliency.

Kafka by default supports at-least once message delivery semantics, which means that when you try to send a message something happens, it will try to send it again.

When creating Kafka Producer properties, you can customize this by setting the retries option to greater than 0.

  Properties props = new Properties(); props.put("bootstrap.servers", "localhost:4242"); props.put("acks", "all"); props.put("retries", 0); props.put("batch.size", 16384); props.put("linger.ms", 1); props.put("buffer.memory", 33554432); props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); Producer<String, String> producer = new KafkaProducer<>(props); 

More details ... this .

+2
source

Source: https://habr.com/ru/post/1011565/


All Articles