High Availability (HA) / Duplex Deployment
High Availability / Duplex Deployment
For high availability, the solution is configured on a three-node cluster. Primary and Secondary nodes have Docker images running while the Arbiter node does the load-balancing.
The two (primary, secondary) will have everything running as if the Secondary node is a replica of the Primary node. Each node of the cluster exposes a VIP (Virtual IP, using VRR protocol) which will route the service request to the active primary node at any point in time.
The two nodes are synchronized with KeepAlived enabled. As soon as the primary node is down, the secondary node resumes services and acts as primary. Thus, the failover from primary to secondary happens nearly seamlessly provided that KeepAlived is running.
The Arbiter node does not replicate the data. Its purpose is to participate in the election process while electing a primary node.
Elections occur when the primary becomes unavailable. The remaining nodes in the replica set, i.e. arbiter and secondary nodes call for an election to select a new primary so that the cluster resume normal operations. The failover from primary to secondary happens nearly seamlessly.
Fault Tolerance
This three-node cluster design ensures that the system is available if the primary or the secondary is unavailable. If the primary is unavailable, the Arbiter will elect the secondary to be primary.
However, if both the primary and the arbiter are down, the secondary will not be elected for the read/write operations.
Note:
Each Node represents a VM.
At any given point of time at-least two nodes should be up.
For the hardware resilience, these nodes/VMs should be on two physical servers. If all the VMs are deployed on the same physical server, it will provide only VM level resilience and fault tolerance.
Failover scenarios
When the Primary or Secondary node is down | |||||||
The system will switch to the other active node. | |||||||
Impact | All applications will continue to work after the seamless failover. | ||||||
Recovery | NA |
When Primary and Secondary nodes are down | |||||||
Impact | All applications and services will not work. | ||||||
Recovery | The manual intervention is required to recover the system. |
When primary or secondary CIM datastore is down | |||||||
The load-balancer (Arbiter) elects the datastore instance on the other node as primary. | |||||||
Impact | All read/write operations on the CIM datastore will continue to function seamlessly. | ||||||
Recovery | NA |
When primary and secondary CIM datastores are down | |||||||
Impact | All read/write operations on the datastore will fail. | ||||||
Recovery | The manual intervention is required to resume the datastore operations. |
When CIM primary data store and the Arbiter are down | |||||||
Impact | All write operations on the primary datastore will fail. The read operation (if configured to be read from secondary DB) will continue to work normally. | ||||||
Recovery | A manual intervention is required to resume the datastore write operations. |
CIM secondary data store and the Arbiter are down | |||||||
Impact | All write operations on the secondary data store will fail. The read operations will continue to work normally. | ||||||
Recovery | A manual intervention is required to resume the datastore write operations. |
The Arbiter is down | |||||||
Impact | All read/write operations on the datastore will continue to function seamlessly. | ||||||
Recovery | NA |