Cache Replacement Algorithm

I have a software project that creates a series of fingerprint values ​​(hash) from objects of different sizes. The larger the object, the more expensive the hash is. Hashes are used for comparative purposes.

Now I want to cache hash values ​​to improve the performance of subsequent comparisons. For any given cache entry, I have the following metrics:

  • number of hits
  • date and time of the last modification
  • hashed object size

So to my question. Given the need to limit the size of the cache (limit it to a specific number of entries), what is a balanced approach to replacing cache elements?

Obviously, larger objects are more expensive than a hash, so they need to be maintained as long as possible. However, I want to avoid a situation where filling the cache with a large number of large objects will prevent future (smaller) elements from being cached.

So, based on the metrics available to me (see above), I am looking for a good general-purpose "formula" for expiring (deleting) cache entries when filling the cache.

All thoughts, comments are appreciated.

+4
source share
5 answers

You need to think about the nature of objects. Think about the likelihood that objects will be called again. And delete the least likely object.

This is very specific to the software you use and the nature of the objects.
If they are constantly used in the program, they are likely to adhere to the principle of Locality of reference . Therefore, you should use the LRU algorithm (least used).

If objects with a higher hit count are more often called, then use this (and delete the lowest).

Take a look at Cache Algorithms

Basically, you need to calculate:

min (p * cost)

p = probability of recall.
cost = The cost of re-caching this object.

+1
source

Assuming recording capability at the last access to the recording, I would go with a β€œCost” for each recording, where you delete the least expensive recording at any time.

Cost = Size * N - TimeSinceLastUse * M 

Assuming you completely delete entries from the cache (and do not store the old hitcount data), I would avoid using hitcount as an indicator, you would get a record with a high hit coefficient, since it has been there for a long time, and it will be there even longer because he has a high level of hitcount.

+1
source

Usually I use a lowercase, minimal used (LRU) scheme to drop things from the cache, unless it is very expensive to reconstruct some elements. LRU has the advantage of being easy to implement, and it works great for a wide range of applications. It also stores the most frequently used items in the cache.

In essence, I am creating a linked list that is also indexed by dictionary. When a customer wants an item, I look at it in the dictionary. If it is found, I detach the node from the linked list and move it to the top of the list. If the item is not in the cache, I create it (load it from the disk or something else), put it at the head of the list, insert it into the dictionary and delete the item located at the tail of the list.

+1
source

You might want to try the layered cache style. Devote a certain percentage from cache to expensive to create elements and part to easily create, but more strongly accessible elements. You can then use different strategies to maintain an expensive cache than a less expensive one.

+1
source

The algorithm may take into account the cost of reproducing the missing element. That way, you would save the most valuable items in the cache.

0
source

Source: https://habr.com/ru/post/1343374/


All Articles