Skip to main content

Avoiding split brain

Note

Striim Cloud Mission Critical is not subject to split-brain.

In a multi-server Striim cluster with the metadata repository hosted on Oracle or PostgreSQL, a network partition that splits the cluster into two subsets that cannot communicate with each other will cause both subsets to go into failover mode (commonly called split brain), resulting in an unpredictable variety of errors and eventually a crash.

To prevent this from happening, on each server:

  1. Edit startUp.properties and set ClusterQuorumSize to just over half the number of servers in the cluster. For example, for a two-server or three-server cluster, set ClusterQuorumSize=2; for a four-node cluster, set ClusterQuorumSize=3.

  2. Once a second, Striim pings each server in the cluster to verify that it is online. If a server does not respond for 60 seconds, it is no longer considered part of the cluster. (To change this, set ClusterHeartBeatTimeout to the desired number of seconds.) When the number of servers in the cluster drops below the quorum, all remaining servers will shut down.

Then restart all servers (see Starting and stopping Striim Platform).