We worked on this problem by implementing Sharding data.
Basically, without going into the code details:
- Write your own index code
- use serial hashing to decide which identifier is sent to the Solr server
- index each user's data on the corresponding splinter (there may be several machines).
- make sure you have redundancy
- Query solr shards
- Solr string queries using the
shards parameter - Run EmbeddedSolr and use it to execute a fined request
- Solr will request all the fragments and combine the results, as well as provide timeouts if you need to limit the request time for each fragment.
Even with everything I said above, I donโt think Solr is suitable for this. Solr is not well suited to search by indexes that are constantly changing, and also if you are mostly searching by identifiers than a search engine is not needed.
For our project, we mainly implement all index construction, load balancing, and the query mechanism and use Solr mainly as storage. But we started using Solr when sharding was flaky and not executed, I'm not sure what his condition is today.
Finally, if I built this system today from scratch without any work that we have done over the past 4 years, I would recommend using the cache to store all the users who are currently on the network (say memcached or redis ), and query time I just went over all of them and filtered out according to the criteria. Filtering by criteria can be cached independently and updated gradually, as well as repeating more than 5000 records does not necessarily take a lot of time if the matching logic is very simple.
source share