I am working on an ES cluster of 6B small documents organized in 6.5K indexes, for a total of 6TB. Indexes are replicated and distributed between 7 servers. Index populations range from a few kilobytes to hundreds of GB.
Before using ES, I used Lucene with the same document organization.
The recovery of the Lucene application was fairly immediate . In fact, indexes were lazily loaded when a request arrived, and then IndexReader were cached to speed up future responses.
Now with Elasticsearch, recovery is very slow (tens of minutes). Please note that usually before the failure, all indexes open and most of them often receive documents for indexing.
Is there a good template for reducing ES recovery time? I'm also interested in everything related to index management, not just configuration. For example, I would like to quickly restore the most important indexes, and then load all the rest; By doing this, I can reduce the perceived downtime for most users.
I am using the following configuration:
#Max number of indices cuncurrently loaded at startup
indices.recovery.concurrent_streams: 80
#Max number of bytes cuncurrently readed at startup for loading the indices
indices.recovery.max_bytes_per_sec: 250mb
#Allow to control specifically the number of initial recoveries of primaries that are allowed per node
cluster.routing.allocation.node_initial_primaries_recoveries: 20
#Max number of indices cuncurrently loaded at startup
cluster.routing.allocation.node_concurrent_recoveries: 80
#the number of streams to open (on a node level) for small files (under 5mb) to recover a shard from a peer shard
indices.recovery.concurrent_small_file_streams: 30
PS: I am using ES 2.4.1 now, but I will be using ES 5.2 in a few weeks. PPS: the scenario may be recovery after a power outage.
Thank!