Striim 3.10.3 documentation

Avoiding split brain

In a multi-server Striim cluster with the metadata repository hosted on Oracle or PostgreSQL, a network partition that splits the cluster into two subsets that cannot communicate with each other will cause both subsets to go into failover mode (commonly called split brain), resulting in an unpredictable variety of errors and eventually a crash.

To prevent this from happening, on each server:

  1. Edit and set ClusterQuorumSize to just over half the number of servers in the cluster. For example, for a three-server cluster, set ClusterQuorumSize=2; for a four-node cluster, set ClusterQuorumSize=3.

  2. By default, when the number of servers in the cluster drops below the quorum, each server will wait 60 seconds for communication to resume before shutting down. To change that timeout, set ClusterHeartBeatTimeout to the desired number of seconds.

Then restart all servers (see Starting and stopping Striim).