Redis as reverse lookup counter cache for mysql

Question

Redis as reverse lookup counter cache for mysql

I have very high throughput, for which I am trying to save the "number of views" for each page in mySQL database (for the reasons why they should end up in mySQL).

An exceptional number of views makes it impractical to execute SQL queries "UPDATE ITEM SET VIEW_COUNT = VIEW_COUNT + 1". There are millions of items, but most of them are viewed only a few times, others are viewed many times.

So, I am considering using Redis to collect view counts with a background thread that writes counters to mySQL. What is the recommended method for doing this? There are some problems with the approach:

How often is the background thread running?
How does it determine what to write in mySQL?
Should I keep Redis KEY for every ITEM found?
Which TTL should I use?
there is a ready-made solution or PowerPoint presentation that gets me halfway, etc.

I saw very similar questions on StackOverflow, but with no great answer ... bye! Hope that at this moment even more knowledge of Radish.

+6

mysql caching redis

OneSolitaryNoob May 26, '13 at 17:34

source share

3 answers

Consolidating my other answer:

Define the time interval in which the transition from redis to mysql should occur, i.e. minute, hour or day. Define it in such a way that you can quickly and easily identify the identification key. This key must be ordered, i.e. Less time should give a smaller key.

Let it be hourly, and the key will be YYYYMMDD_HH for readability.

Define a prefix like "hitcount _".

Then for each time interval you set the hash hitcount_<timekey> in redis, which contains all the requested elements of this interval in the form ITEM => count.

There are two parts to the solution:

Actual page to be considered:
a) get the current $timekey , i.e. by date function
b) get the value of $ITEM
b) send the command redis HINCRBY hitcount_$timekey $ITEM 1
A cronjob that runs in this given interval is not too close to the limit of these intervals (for example: not in a full hour). This cronjob:
a) Retrieves the current time key (currently it will be 20130527_08)
b) Requests all the relevant keys from redis using KEYS hitcount_* (this should be a small number)
c) compares each such hash with the current hitcount_<timekey>
d) if this key is less than the current, then treat it as $processing_key :
- read all ITEM pairs => counter by HGETALL $processing_key as $ item, $ cnt
- update the database with `UPDATE ITEM SET VIEW_COUNT = VIEW_COUNT + $ cnt, where ITEM = $ item"
- remove this key from the hash HDEL $processing_key $item
- No need to make the hash itself - there are no empty hashes in redis, as far as I tried

If you want to enable TTL, tell me whether cleaning-cronjob can be unreliable (it may not work for many hours), then you can create future cronjob hashes with the corresponding TTL, which means now we can create hash 20130527_09 with ttl 10 hours, 20130527_10 with TTL 11 hours, 20130527_11 with TTL 12 hours. The problem is that you will need pseudo-jockey because empty hashes seem to be automatically deleted.

+2

flaschenpost May 27 '13 at 6:37

source share

See EDIT3 for the current state of A ... nswer.

I would write a key for each ITEM. A few tenths and fourths are definitely not a problem.

How many pages change? I mean, you get a lot of pages that will never be called again? Otherwise, I just:

Add a value for ITEM on the page request.
every minute or 5 minutes they call a cronjob that reads redis keys, reads a value (say 7) and decreases it by decrby ITEM 7. In MySQL, you can increase the value for this element by 7.

If you have many pages / ITEMS that will never be called again, you can do a clean-up task once a day to remove keys with a value of 0. This should be blocked from re-adding this key from the website.

I would not set TTL at all, so the values should live forever. You can check the memory usage, but I see many different pages with the current GB of memory.

EDIT: incr is very good for this because it sets the key if it is not set before.

EDIT2: Given the large number of different pages, instead of the slow "keys *" command, you can use HASHES with incrby ( http://redis.io/commands/hincrby ). However, I'm not sure if HGETALL is much faster than KEYS *, and HASH does not allow TTL for single keys.

EDIT3: Oh well, sometimes good ideas linger. It is so simple: just attach the key with a slot (for example, day-hour) or create a HASH named "request_". Then there will be no overlapping deletion and increment! Every hour you take the possible keys with the old day_hour_ * values, update MySQL and delete these old keys. The only condition is that your servers are not too different on their clocks, so use UTC and synchronized servers and do not run cron at x: 01, but x: 20 or so.

This means: the called page converts the call to ITEM1 at 23:37, May 26, 2013 to Hash 20130526_23, ITEM1. HINCRBY count_20130526_23 ITEM1 1

An hour later, the keys count_* list is checked, and everything until count_20130523 processed (read the key value by hgetall, update mysql) and deleted one after the other after processing (hdel). Upon completion of this, you will check if hlen 0 and del count _...

Thus, you only have a small number of keys (one per raw hour), which makes keys count_* quickly, and then processes the actions of that hour. You can give TTL for hours if your cron lingers or picks up time or for a while or something like that.

+1

flaschenpost May 26 '13 at 17:47

source share

The real bill · Accepted Answer · 2013-05-27T19:39:55+0000

I think you need to step back and look at some of your questions from a different angle in order to get answers.

"how often is the background thread running?" To answer this, you need to answer these questions: how much data can you lose? What is the reason that the data is in MySQL, and how often is this data available? For example, if a database is required only once a day to receive a report, you may need to update it only once a day. On the other hand, what if a Redis instance dies? How many increments can you lose and still be "good"? They will provide answers to the question of how to update your MySQL instance frequently, and we cannot answer for you.

I would use a completely different strategy to store this in redis. For the sake of discussion, let's assume that you decide that you need to “flush dB” every hour.

Store every hit in the hashes using the key name structure in these lines:

interval_counter:DD:HH interval_counter:total

Use the page identifier (for example, the sum of the MD5 URI, the URI itself or any other identifier that you are currently using) as a hash key and make two increments on the page; one for each hash. This gives you the current amount for each page and a subset of the pages to update.

After that, you will have a cron job for a minute or so after the start of the hour to display all pages with the updated number of views, capturing the previous hash. This gives you a very fast means of retrieving data for updating your MySQL database, while avoiding the need to do maths or play tricks with timestamps, etc. Pulling data from a key that no longer increases on the bing, you avoid the race conditions due to the distortion of the clock.

You can set the expiration of the daily key, but I would prefer to use the cron job to delete it when it successfully updated the database. This means that your data still exists if the cron job fails or fails. It also provides the interface with a complete set of known counter data using keys that do not change. If you wanted to, you could even store data daily to be able to view windows on how popular the page is. For example, if you saved a daily hash for 7 days, setting the expiration date using the cron job instead of deleting, you can display how much traffic each page had per day in the last week.

Two hincr operations can be performed both solo and pipelined; it is still performed quite well and more efficiently than performing calculations and processing data in code.

Now the question is about running out of pages with low traffic and memory usage. Firstly, your data set is not like the one that will require huge amounts of memory. Of course, a lot depends on how you identify each page. If you have a numeric identifier, the memory requirements will be quite small. If you are still loading too much memory, you can configure it through configuration, and if necessary, you can even use 32-bit redis compilation to significantly reduce memory usage. For example, the data that I describe in this answer I used to manage one of the ten most downloaded forums on the Internet and consumed less than 3 GB of data. I also saved the counters in a much more "time window" than I describe here.

However, in this case, Redis is a cache. If you still use too much memory after the options above, you can set the key expiration and add an expire command for each ht. More specifically, if you follow the pattern above, you will do the following for each hit:

 hincr -> total hincr -> daily expire -> total

This allows you to keep everything that is actively used fresh by extending its validity period at every access. Of course, for this you will need to wrap your screen call to catch the null answer for hget to hash the totals and populate it from the MySQL database, and then increase. You can even do both as an increment. This will preserve the above structure and will most likely be the same code base that is needed to upgrade a Redis server from MySQL Db if you need to reconfigure Redis node. To do this, you will need to consider and decide which data source will be considered authoritative.

You can tune cron job performance by changing your interval to match the data integrity parameters that you determine from earlier questions. To speed up cron nob, you shrink the window. With this window-shrinking method, you should have a smaller set of pages to refresh. The big advantage here is that you do not need to determine which keys you need to update, and then go after them. you can do hgetall and iterate over the hash keys for the update. It also saves a lot of rounds by getting all the data at once. In any case, if you probably want to consider the second instance of Redis, subordinate to the first, to complete your readings. You are still performing a deletion against the wizard, but these operations are much faster and less likely to introduce delays in your recording instance.

If you need a Redis DB hard drive, then of course put this on a subordinate instance. Otherwise, if you have a lot of data that changes frequently, your RDB dumps will constantly work.

Hope this helps. There are no "canned" answers, because in order to use Redis correctly, you first need to think about how you will access the data, and this is very different from the user to the user and the project for the project. Here I based the route taken from this description: two consumers accessing the data, one to display, and the other to determine the update of another data source.

Redis as reverse lookup counter cache for mysql

More articles: