Let's say my system wants to listen to user click events and store them in the archive storage. I know where the event comes from (userId - about a hundred users) and which URL was clicked. (url - infinite variations)
class ClickEvent { String userId; String url; }
If my system potentially receives thousands of events per second, I donโt want to place a massive load in the storage, calling it once every time I visit clicks. Suppose the repository is an AWS S3-type repository or a data repository where it is good for storing fewer large repository files than sending tens of thousands of requests per second.
My approach currently is to use the GoogleGuava Cache library. (or just any cache with cache expiration support)
Suppose the key for the cache is userId and the value for the cache is List<url> .
- Caching error -> Add entry to cache
(userId, [url1]) - Cache click -> I am adding a new URL to the list
(userId, [url1, url2...]) - The cache expires after a custom X minute from the time it was originally written, or after 10,000 URLs.
- After the input expires, I insert the data into the repository, ideally reducing up to 10,000 small individual transactions to 1 large transaction.
I'm not sure if there is a โstandardโ or better way (or even a well-known library) to solve this problem, that is, to accumulate thousands of events per second and save them all in the storage / file / data store at the same time, instead of transferring high loads top down to downstream services. I feel that this is one of the common uses of a large data system.
source share