Sections of Kinesis Firehose S3 event time recording

Firehose-> S3 uses the current date as a prefix to create keys in S3. Thus, this separates the data at the time of recording. My firehose stream contains events that have a specific event time.

Is there any way to create S3 keys containing this event time? The downstream processing tools depend on each event in the “watch directory” associated with when it really happened. Or should this be an additional processing step after Firehose is done?

The event time may be in a section of a section, or I could use the Lambda function to parse it from a record.

+5
source share
2 answers

Kinesis Firehose does not yet allow customers to control how the date suffix is ​​generated for S3 targets.

The only option with you is to add a layer after processing after Kinesis Firehose. For example, you can schedule an EMR hourly job using the Data Pipeline, which reads all files recorded in the last hour and publishes them to correct S3 destinations.

+3
source

This is not an answer to the question, however I would like to clarify the idea of ​​storing records in accordance with the time of arrival of the event.

First, a few words about streams. Kinesis is just a data stream. And he has a concept of consumption. You can reliably consume a stream only by reading it sequentially. And there is also the idea of ​​checkpoints as a mechanism to suspend and resume the consumption process. A control point is simply a sequence number that identifies a position in a stream. By specifying this number, you can start reading the stream from a specific event.

And back to the default setting s3 firehose .... Since the capacity of the flow of kinesia is very limited, most likely you need to store data from kinesis somewhere in order to analyze it later. And installing firehose for s3 does it right out of the box. It just stores raw data from the stream into s3 buckets. But logically, this data represents the same stream of records. And in order to be able to reliably consume (read) this stream, we need these serial numbers for control points. And these numbers are the time of arrival of the records.

What if I want to read records by creation time? It seems that the right way to accomplish this task is to read the s3 stream sequentially, upload it to some database or time series of data, and make records based on the creation time, based on this storage. Otherwise, there will always be an unnecessary chance to skip several bundles of events when reading s3 (stream). Therefore, I would not suggest reordering s3 buckets at all.

0
source

Source: https://habr.com/ru/post/1264001/


All Articles