Kinesis Shard Inetrator Explanation - Java SDK

Ok, I will start with the developed use case and explain my question:

  • I use a third-party web analytics platform that uses Kinesis AWS streams to transfer data from the client to the final destination - Kinesis stream;
  • The web analytics platform uses 2 threads:
    • data collector stream (single fragment stream);
    • The second stream to enrich the raw data from the collector stream (single fragmentation stream); Most importantly, this stream consumes raw data from the first stream using the TRIM_HORIZON TRIM_HORIZON ;
  • I use data from a stream using the AWS Java SDK , in secret using the GetShardIteratorRequest class;
  • I am currently developing a retrieval class, so this is done synchronously, that is, I only use data when compiling my class;
  • The class works unexpectedly, although there are some things that I don’t understand, especially regarding how the data is consumed from the stream and the values ​​of each of the iterator types;

My problem is that the data I retrieve is incompatible and does not have chronological logic in it.

  • When I use AT_SEQUENCE_NUMBER and provide the first sequence number from the shard with

    .getSequenceNumberRange () getStartingSequenceNumber () ;.

    ... like,, I do not get all the records. Similarly, AFTER_SEQUENCE_NUMBER ;

  • When I use LATEST , I get null results;
  • When I use TRIM_HORIZON , which should make sense to use, it does not seem to work fine. He used to provide me with data, and then added new β€œevents” (writes to the final stream), and I got zero records. Secret.

My questions:

  • How can I safely use data from a stream without worrying about missing records?
  • Is there an alternative to ShardIteratorRequest ?
  • If there is, how can I just β€œbrowse” the stream and see what's inside it to debug the links?
  • What am I missing with the TRIM_HORIZON method?

Thanks in advance, I would love to know a little more about consuming data from the Kinesis stream.

+6
source share
1 answer

I understand the confusion above, and I had the same problems, but I think I understood it now. Please note that I use the JSON API directly without KCL.

The API seems to give clients 2 basic options for iterators when they start consuming a stream:

A) TRIM_HORIZON: for reading PAST records delayed between many minutes (even hours) and 24 hours. It does not return recently set records. Using AFTER_SEQUENCE_NUMBER in the last record this iterator sees returns an empty array even when the records were recently PUT.

B) RECENT: for reading FUTURE records in real time (immediately after PUT). I was deceived by the only sentence of documentation that I could find on this "Start reading immediately after the most recent entry in the shard so that you always read the latest data in the shard." You were getting an empty array because no records have been PUT since you received the iterator. If you get this type of iterator and then start the record, this record will be immediately available.

Finally, if you know the sequence identifier of a recently set record, you can retrieve it immediately using AT_SEQUENCE_NUMBER, and you can retrieve later records using AFTER_SEQUENCE_NUMBER, even if they will not be displayed to the TRIM_HORIZON iterator.

The foregoing means that if you want to read all known past records and future records in real time, you need to use a combination of A and B, with logic, to handle the records between them (recent past). KCL can smooth this out well.

0
source

Source: https://habr.com/ru/post/975423/


All Articles