Cluster hanging on node failure

Hello to all of you, bright people,

Currently, a small 300 GB cluster is operating in production at 5 nodes with about 30 million documents. Everything works flawlessly, except when the node is really dropping (I mean network failure or HW).

Usually, when we lose a node, the cluster becomes more or less completely unresponsive for several minutes. And regarding indexing and queries. This, of course, is less than ideal, since we have a 24/7 load.

I would really appreciate help in understanding the best practice settings in order to have a reliable cluster.

The first goal for us is that the cluster will not stop responding in the event of a node failure. After reading all that I can find on the Internet, I can’t understand if the ES is designed for ping_retries * ping_timeout seconds or if the cluster will continue to query server requests even during this time. Can someone help me shed some light on this?

Secondly, in the event of an even worse failure, when the cluster goes red, is it possible to allow the cluster to still serve read / request requests?

I would be so grateful for anyone who would like to help me understand how this works or what we need to change to make our ES installation more reliable.

Here I included our configuration:

cluster.name: clustername node.name: nodename path.data: /data node.master: true node.data: true discovery.zen.minimum_master_nodes: 2 discovery.zen.ping.multicast.enabled: false discovery.zen.ping.multicast.ping.enabled: false discovery.zen.ping.unicast.enabled: true discovery.zen.ping.unicast.hosts: ["host1","host2","host3"] bootstrap.mlockall: true http.cors.enabled: true index.number_of_shards: 10 action.disable_delete_all_indices: true marvel.agent.exporter.es.hosts: ["marvel:9200"] 
+6
source share
1 answer

Cluster freezes due to error due to fault detection timeout :

 discovery.zen.fd.ping_interval: 1s -> default 1s discovery.zen.fd.ping_timeout: 2s -> default 30 secs discovery.zen.fd.ping_retries: 3 -> default 3 secs 

There are two fault detection processes.

The first belongs to the master, pings all other nodes in the cluster and verifies that they are alive.

Second , each node pings to check whether it needs to continue to live or the election process.

With the above configuration: If node fails, the Wizard will retry 3 times with a timeout of 2 seconds (sum = 6secs hang) instead of 90s wait (hangs).

Please note that I am running a cluster on a local network with the ability to connect up to 1 ms and 1 Gb / s. Depending on your environment, you must configure accordingly. I am involved in elasticsearch 5.1.1, you need to refer to the version documentation for the exact syntax.

+1
source

Source: https://habr.com/ru/post/982724/


All Articles