Solr% 100 Write Capability During Optimization

So here is my dilemma ...

I run a real-time search index with Solr, indexing about 6M documents per day. Documents expire in 7 days. Therefore, every day I add 6M documents and delete 6M documents. Unfortunately, I need to run “optimize” every so often, otherwise I will run out of disk space.

During “optimization,” Solr continues to serve read requests, but write requests are blocked. I have all my records in line, so quickly, everything is in order. However, since my index is so large, “optimization” takes about an hour, and during this hour new updates are not available for reading. Thus, my index is in real time, with the exception of the hour per day, which I optimize. During this time, it seems that the index is an hour behind. This is not optimal.

My current solution is this: write all the data in two Solr indexes, both queues. An alternative “optimize” on two indices every 12 hours. During the "optimization" of index 1, direct all read traffic to index 2 and vice versa. This time-based routing seems rather fragile and sloppy, though.

Is there a better way?

+3
source share
4 answers

According to the comments here and the FAQ here , optimization should not be necessary. Non-optimization may first increase the size of the index, but it should not constantly increase. I suggest you turn off optimization for a few days and control the size of the index.

+4
source

Another time-based option is to maintain a separate index for each day and record all indexes every day. In this case, you do not need to do the deletions, and instead you rotate the indexes in a first-in-first-out (FIFO) order.

Index 1 = Day 1 + Day 2 + Day 3 + Day 4 + Day 5 + Day 6 + (no longer used)
Index 2 = empty + Day 2 + Day 3 + Day 4 + Day 5 + Day 6 + Day 7 + (no longer used)
Index 3 = empty + empty + Day 3 + Day 4 + Day 5 + Day 6 + Day 7 + Day 8
... 

. 2 1 , 2 .

, , ( 2 1 ..), , , , .

+1

Have you tried using different mergefactors or a different merge policy? If you are doing constant writing, this may be a better approach than optimization.

0
source

Use replication.

Write to your Master, repeat your slave. Optimization will be performed on your Master and run all queries to the slave.

0
source

Source: https://habr.com/ru/post/1794363/