So here is my dilemma ...
I run a real-time search index with Solr, indexing about 6M documents per day. Documents expire in 7 days. Therefore, every day I add 6M documents and delete 6M documents. Unfortunately, I need to run “optimize” every so often, otherwise I will run out of disk space.
During “optimization,” Solr continues to serve read requests, but write requests are blocked. I have all my records in line, so quickly, everything is in order. However, since my index is so large, “optimization” takes about an hour, and during this hour new updates are not available for reading. Thus, my index is in real time, with the exception of the hour per day, which I optimize. During this time, it seems that the index is an hour behind. This is not optimal.
My current solution is this: write all the data in two Solr indexes, both queues. An alternative “optimize” on two indices every 12 hours. During the "optimization" of index 1, direct all read traffic to index 2 and vice versa. This time-based routing seems rather fragile and sloppy, though.
Is there a better way?
source
share