Getting the most commonly used items without considering each item

I was wondering if there is an algorithm for counting the “most used items” without having to count the amount of each item? For example, suppose I was a search engine and wanted to track the 10 most popular queries. What I do not want to do is the counter of each request, because for me there can be too many requests (and most of them will be single). Is there a simple algorithm for this? Maybe something probable? Thanks!

+4
source share
4 answers

Well, if you have a very large number of queries (for example, the search engine is supposed to be), then you can simply "select" the queries. Thus, you can receive 1000 requests per second, but if you just keep the score one per second, then for a long period of time you will receive an answer that will be relatively close to the "real" answer.

Here's how, for example, the "sample" profiler works. Each n millisilin considers what function is currently being performed. Over a long period of time (a few seconds) you get an idea of ​​the "expensive" functions, because they most often appear in your samples.

You still have to do a “count”, but by doing periodic selections, instead of counting each individual query, you can get the upper limit of the amount of data that you actually need to store (for example, a maximum of one query per second, etc.). )

+4
source

If you need the most commonly used search queries at any given time, you don’t need to have endless counters tracking every sent request. Instead, you need an algorithm to measure the number of views for any given query divided by a specific period of time. This is a fairly simple algorithm. Any search sent to the search engine, for example, the word "cache", is stored for a fixed period of time called the refresh rate (the length of your refresh rate depends on the type of traffic that your search engine receives and the sum of the "best results" that you want to track). If the period of time for updating the period expires and the search for the word "cache" is not saved, the request will be deleted. If the search for the word "cache" is saved, your algorithm should only track the speed at which the search for the word "cache" is performed. To do this, simply save all requests to the "sampler". Each record is placed on the counter with an expiration date, after which the request is deleted. Your active counters are indicators of your top queries.

+2
source

Keeping every request will be expensive, but necessary to ensure that the top 10 is actually the top 10. You have to cheat.

One idea is to store a table of URLs, note counters and a timestamp indexed by count, and then a timestamp. When the table reaches some arbitrary maximum value, start deleting low records that are older than the specified number of days. Although old, infrequent requests will not be counted, requests that are likely to make the top 10 should do so on the table due to the faster request speed.

Another idea is to write a 16-bit (or more) hash function for search queries. You have a table with 65536 entries that stores counters and URLs. When the search is completed, increase the corresponding entry in the table and, if necessary, set the URL. However, this approach has a major drawback. A spam bot could make repeated requests, such as "cheap viagra", it is possible for legitimate requests to increase the counts of spam requests by placing their messages on the main page.

0
source

You need a cache, which is a lot; Wikipedia Cache Algorithms and Page Replacement Algorithm Aging.

0
source

Source: https://habr.com/ru/post/1308855/


All Articles