Hello to all of you, bright people,
Currently, a small 300 GB cluster is operating in production at 5 nodes with about 30 million documents. Everything works flawlessly, except when the node is really dropping (I mean network failure or HW).
Usually, when we lose a node, the cluster becomes more or less completely unresponsive for several minutes. And regarding indexing and queries. This, of course, is less than ideal, since we have a 24/7 load.
I would really appreciate help in understanding the best practice settings in order to have a reliable cluster.
The first goal for us is that the cluster will not stop responding in the event of a node failure. After reading all that I can find on the Internet, I canβt understand if the ES is designed for ping_retries * ping_timeout seconds or if the cluster will continue to query server requests even during this time. Can someone help me shed some light on this?
Secondly, in the event of an even worse failure, when the cluster goes red, is it possible to allow the cluster to still serve read / request requests?
I would be so grateful for anyone who would like to help me understand how this works or what we need to change to make our ES installation more reliable.
Here I included our configuration:
cluster.name: clustername node.name: nodename path.data: /data node.master: true node.data: true discovery.zen.minimum_master_nodes: 2 discovery.zen.ping.multicast.enabled: false discovery.zen.ping.multicast.ping.enabled: false discovery.zen.ping.unicast.enabled: true discovery.zen.ping.unicast.hosts: ["host1","host2","host3"] bootstrap.mlockall: true http.cors.enabled: true index.number_of_shards: 10 action.disable_delete_all_indices: true marvel.agent.exporter.es.hosts: ["marvel:9200"]