How to deduplicate records from AWS Kinesis Firehose to Redshift?

I read the official AWS Kinesis Firehose document, but it does not mention how to handle duplicate events. Does anyone have any experience? I googled someone using ElasticCache for filtering, does that mean I need to use AWS Lambda to encapsulate such filtering logic? Is there any simple way, for example, firehose for receiving data in Redshift and at the same time has “exactly once” semantics? Many thanks!

+4
source share
1 answer

You may have duplication on both sides of the Kinesis stream. You can put the same events twice in Stream, and you can read this event twice by consumers.

The producer side can happen if you try to put an event in a Kinesis stream, but for some reason you are not sure if it was written successfully or not, and you decided to repeat it. The consumer side can happen if you receive a batch of events and start processing them, and you fail before you manage to execute a checkpoint of your location, and the next worker selects the same batch of events from the Kinesis stream, based on the last checkpoint sequence ID

, , . , . , , , - - , .

ElasticCache Redis - . , , , - Redis, , , , ( TTL ).

Kinesis Firehose ( Kinesis), , . , ​​ , Kinesis Lambda KCL, Redshift (. ).

, Redshift, COUNT DISTINCT LAST_VALUE WINDOW.

+9

Source: https://habr.com/ru/post/1624552/


All Articles