I would like to receive some feedback and suggestions regarding the two approaches that I am considering for implementing searchable indexes using sorted Redis sets.
Situation and purpose
We currently have some tables of key values โโthat we store in Cassandra and for which we would like to have indexes. For example, one table will contain records of people, and the Cassandra table will have id as its primary key, and the serialized object as value. An object would have fields such as first_name, last_name, last_updated, and others.
We want to be able to execute queries such as "last_name = 'Smith" AND first_name> "Joel", "last_name <" Aaronson "," last_name =' Smith "AND first_name = 'Winston'" and so on. Match IDs should be shown in the search results, so we can retrieve objects from Cassandra. I think that the above searches could be done with a single index, sorted by lexicography by last_name, first_name and last_updated. If we need some searches using a different order (for example, "first_name =" Zeus "), we can have a similar index that would allow us to use such (for example, first_name, last_updated).
We are considering using Redis for this because we need to be able to process a large number of records per minute. I read some common ways to sort Redis sets and came up with two possible implementations:
Option 1: one sorted set for an index
For our index by last_name, first_name, last_updated, we would have a sorted set in Redis under the key indexes: people: last_name: first_name: last_updated, which would contain strings with the format last_name: first_name: last_updated: id. For instance:
blacksmith: Joel: 1372761839.444: 0azbjZRHTQ6U8enBw6BJBw
(For the separator, I could use "::" rather than ":" or something else to work better with the lexicographic order, but ignore it for now)
Elements will be given a score of 0, so the sorted set will simply be sorted lexicographically by the lines themselves. If then I want to make a query like "last_name = 'smith" AND first_name <' bob '", I will need to get all the items in the list that go before' smith: bob '.
As far as I can tell, there are the following disadvantages for this approach:
- There is no Redis function to select a range based on a string value. This function, called ZRANGEBYLEX, was proposed by Salvatore Sanfilippo at https://github.com/antirez/redis/issues/324 , but not implemented, so I would have to find the endpoints using binary searches and get (possibly using Lua or at the application level with Python, which is the language we use to access Redis).
- If we want to include time for writing indexes, it seems that the easiest way to do this is to have a scheduled task that goes through the entire index and removes expired items.
Option 2: small sorted sets sorted by last_updated
This approach would be similar, except that we would have many smaller sorted sets, each of which would have a temporary value, such as last_updated for evaluations. For example, for the same last_name, first_name, last_updated index, we would have a sorted set for each combination last_name, first_name. For example, the key could be an index: people: last_name = smith: first_name = joel, and it will have an entry for each person we named Joel Smith. Each record will have a name and identifier, as well as its last_updated value. For instance:.
value: 0azbjZRHTQ6U8enBw6BJBw; Rating: 1372761839.444
The main advantages of this are (a) a search where we know that all fields except last_updated will be very easy, and (b) the implementation of time for life will be very simple using ZREMRANGEBYSCORE.
The disadvantage, which seems to me very large:
- In management and search, this method seems to be much more complicated. For example, we need an index to track all of its keys (in the case of, for example, we want to clear at some point) and do it in a hierarchical order. A search, such as "last_name <" smith ", would require first to look through the list of all the last names to find those that go before the blacksmith, and then for each of those who look at all the names that it contains, and then for each of them that get all the items from their sorted set, in other words, a lot of components to create and worry about.
Completion
So, it seems to me that the first option will be better, despite its shortcomings. I would really appreciate any feedback on these two or other possible solutions (even if they want to use something other than Redis).