How to split Azure tables used for log storage

We recently updated our registration to use the Azure table storage, which is due to its low cost and high performance when a row and section query is very suitable for this purpose.

We strive to follow the recommendations in Developing a Scalable Partitioning Strategy for Azure Table Storage . As we make a large number of inserts into this table (and, hopefully, an increasing number, as we scale), we must ensure that we do not fall within our limits, resulting in logs being lost. We structured our design as follows:

  • We have an Azure account for each environment (DEV, TEST, PROD).

  • We have a table for each product.

  • We use TicksReversed + GUID for the Row Key so that we can request result blocks between specific times with high performance.

  • We initially decided to split the table into Logger, which for us were wide areas of the product, such as API, Application, Performance and caching. However, due to the low number of partitions, we were that this led to the so-called β€œhot” partitions, where many inserts were performed on one partition over a given period of time. So we changed the section to Context (for us, the class name or resource API).

However, in practice, we have found that this is not so ideal, because when we look at our magazines at a glance, we would like them to appear in the order of time. Instead, we end up with blocks of results grouped by context, and we will need to get all the sections if we want to arrange them by time.

Some of the ideas we had were

  • use time blocks (e.g. 1 hour) for partition keys to order them by time (results in hot sections for 1 hour)

  • use a few random GUIDs for section keys to try to distribute the logs (we lose the ability to quickly request features like Context).

Since this is such a regular Azure table storage application, there must be some standard procedure. What is the best practice for partitioning Azure tables that are used to store logs?

Solution Constraints

  • Use cheap Azure storage (table storage seems obvious)

  • Fast, scalable recording

  • Low probability of lost logs (i.e. exceeding the write speed of a partition of 2000 instances per second in Azure table storage).

  • Reading is ordered by date, most recent first.

  • If possible, separate what would be useful for the query (e.g. product area).

+6
source share
4 answers

I came across a similar situation that you encountered, based on my experience, which I could say:

Whenever a query is launched in the azure storage table, it performs a full table scan if the corresponding partition key is not specified. In other words, the storage table is indexed by the partition key, and proper data partitioning is the key to quickly get results.

However, now you have to think about what queries you will shoot on the table. Such as magazines, occurred over a period of time, for a product, etc.

One way is to use reverse ticks to the nearest hour instead of using exact ticks as part of the partition key. Thus, based on this partition key, you can request data about the clock. Depending on the number of rows that fall into each section, you can change the accuracy by one day. In addition, it will be advisable to store related data together, which means that the data for each product will go to another table. Thus, you can reduce the number of sections and the number of rows in each section.

Basically, make sure that you know the partition keys (exact or rangefinder) in advance and call queries with those partition keys in order to get results faster.

To speed up writing to a table, you can use a batch operation. Be careful though, as if one entity in a party stopped working with a full party. Proper retry and error checking can save you here.

At the same time, you can use blob storage to store a lot of related data. The idea is to store a piece of related serialized data as a single blob. You can hit one such blob to get all the data in it and make further predictions on the client side. For example, the hourly data value for the product will go to blob, you can develop a specific blob prefix naming pattern and, if necessary, click on the exact frame. This will help you retrieve your data fairly quickly, rather than performing a table scan for each query.

I used the blob approach and used it for several years without any problems. I convert my collection to IList<IDictionary<string,string>> and use binary serialization and Gzip to store each blob. I use the Reflection.Emmit helper methods for quick access to object properties, so serialization and deserialization do not affect the processor and memory.

Storing data in blobs helps me store more in less time and get my data faster.

+5
source

There is a very common trick to avoid hot spots when writing, while at the same time increasing the cost of reading a bit.

Define N partitions (e.g. 10 or so). When writing a line in an arbitrary section. Sections can be sorted by time inside.

When reading, you need to read all N sections (possibly filtered and ordered by time) and combine the query results.

This increases the scalability of the record N times and increases the cost of the request by the same number of rounds and requests.

In addition, you might consider storing magazines in another place. Very severe artificial restrictions on Azure products cause labor costs that you would not otherwise have.

Choose N to be higher than necessary to reach 20,000 operations per second per account limit so that random access points are unlikely. Choosing N twice the minimum required should be sufficient.

+3
source

If I read the question correctly, here are the limitations on the solution:

  • Use table storage
  • High level recording
  • Division by product area
  • Auto Sort by Time

Some good solutions have already been presented, but I don’t think there is an answer that perfectly satisfies all the restrictions.

The solution that seems closest to your limitations was provided by usr. Divide the product area sections by N, but do not use the GUID, just use the number (ProductArea-5). Using a GUID makes the query problem much more complicated. If you use a number, you can request all sections for a product area in one request or even in parallel. Then continue to use TicksReversed + GUID for RowKey.

Single request: PartitionKey ge 'ProductArea' and PartitionKey le 'ProductArea- ~' and RowKey ge 'StartDateTimeReverseTicks' and RowKey le 'EndDateTimeReverseTicks'

Parallel queries: PartitionKey ge 'ProductArea-1' and RowKey ge 'StartDateTimeReverseTicks' and RowKey le' EndDateTimeReverseTicks' ... PartitionKey ge 'ProductArea-N' and RowKey ge 'StartDateTimeReverseTicks' and RowKeyRseT End'

This solution does not satisfy "automatically ordered by time", but you can do a client-side sorting of RowKey to see them in order. If you need to sort the client side, then this solution should work to satisfy other restrictions.

+2
source

Not a very specific answer to your question, but here are some of my thoughts:

In fact, you need to think about how you are going to request your data and develop your data storage / partitioning strategy based on this (bearing in mind the partition development strategy). For instance,

  • If you need to look at the logs for all registrars within a given date / time range, then your current approach may not be acceptable, because you will need to access several sections in parallel.
  • Your current approach will work if you want to request a specific registrar within a given date / time range.
  • Another thing that was suggested to me was to use blob storage and table storage. If there is some data that does not require frequent query, you can just push that data into the blob repository (think of old logs - you really don't need to keep them in tables unless you frequently query them), Whenever you need such data, you can simply extract it from the blob repository, click on it in the table repository and run your special queries against this data.

Possible Solution

One possible solution would be to store several copies of the same data and use these copies accordingly . Since storage is cheap, you can save two copies of the same data. In the 1st instance you can have PK = Date / Time and RK = everything you decide, and in the 2nd instance you could have PK = Logger and RK = TicksReversed + GUID. Then, when you want to get all the logs regardless of the registrar, you can simply request the 1st copy (PK = Date / Time), and if you want to request the logs for a certain type of logger, you can simply request the second copy (PK = Logger, RK> = Date / Time Start and RK <= Date / Time).

You can also find this link useful: http://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/

+1
source

Source: https://habr.com/ru/post/985851/


All Articles