AWS DynamoDB v2: Do I need a secondary index for alternative queries?

I need to create a table that will contain a piece of data created by a continuously running process. This process generates messages that contain, among other things, two required components: a globally unique message UUID and a message timestamp.

These messages will later be received by the UUID.

In addition, on a regular basis, I will need to delete all messages from an too old old table, i.e. timestamps of which are greater than X from the current time.

I read the DynamoDB v2 documentation (e.g. Local Secondary Indexes ), trying to figure out how to organize my table and whether I need an additional index to search for messages to delete . There may be a simple answer to my question, but I'm somehow confused ...

So should I just create a table with a UUID as a hash and messageTimestamp as a range key (along with the message attribute that will contain the actual message) and then not create any secondary indexes? In the examples I saw, the hash was something unique (for example, ForumName at the link above). In my case, the hash will be unique. I'm not sure what that matters.

And if I create a table with a hash and a range, as described, and without a secondary index, then how would I request all messages that are in a specific timer, regardless of their UUID?

+4
source share
3 answers

We also struggled with this. The best solution we came up with is to create a second table to store time series data. For this:

1) Use date plus id "bucket" for the hash key
You could just use the date, but then I assume that today will become a hot key - which is written with excessive frequency. This can create a serious bottleneck because the total bandwidth for a particular DynamoDB partition is equal to the total bandwidth divided by the number of partitions - this means that if all your records belong to the same key (today's key), and you have a bandwidth of 20 records per second, then with 20 partitions, your total throughput will be 1 record per second. Any requests that go beyond this will be throttled. Not a bad situation.

The bucket can be a random number from 1 to n, where n must be greater than the number of partitions used by the base database. Defining n is a bit complicated because Dynamo does not show how many partitions it uses. But we are currently working with an upper limit of 200 based on what is found here here . Recording on this link was the basis of our thinking in developing this approach.

2) Use UUID for range key

3) Recording requests, issuing requests for each day and bucket. This may seem tedious, but it is more effective than a full scan. Another possibility is to use the work of Elastic Map Reduce, but I have not tried it yet, but I can’t say how easy / efficient it is to work.

We are still figuring this out, so I'm interested in hearing the comments of others. I also found this presentation very helpful in thinking about how to best use Dynamo: Falling in and out of love with Dynamo.

-John

+1
source

DynamoDB introduced the Global Secondary Index, which would solve this problem. http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html

+3
source

In short, you can’t. All DynamoDB requests MUST contain the primary hash index in the request. If desired, you can also use the range key and / or local secondary index. With the current DynamoDB functionality, you cannot use LSI as an alternative to the main index. You also cannot issue a query using the range key only (you can easily check this in the AWS console).

A (expensive) workaround I can think of is to give out a table check by adding filters based on the timestamp value to find out which fields to remove. Please note that filtering will not reduce the query capacity used, as it will analyze the entire table.

0
source

Source: https://habr.com/ru/post/1497067/


All Articles