Avoiding split brain
Note
Striim Cloud Mission Critical is not subject to split-brain.
In a multi-server Striim cluster with the metadata repository hosted on Oracle or PostgreSQL, a network partition that splits the cluster into two subsets that cannot communicate with each other will cause both subsets to go into failover mode (commonly called split brain), resulting in an unpredictable variety of errors and eventually a crash.
To prevent this from happening, on each server:
Edit
startUp.properties
and setClusterQuorumSize
to just over half the number of servers in the cluster. For example, for a two-server or three-server cluster, setClusterQuorumSize=2
; for a four-node cluster, setClusterQuorumSize=3
.Once a second, Striim pings each server in the cluster to verify that it is online. If a server does not respond for 60 seconds, it is no longer considered part of the cluster. (To change this, set
ClusterHeartBeatTimeout
to the desired number of seconds.) When the number of servers in the cluster drops below the quorum, all remaining servers will shut down.
Then restart all servers (see Starting and stopping Striim Platform).