High Availability (HA) / Duplex Deployment

High Availability in ECM is planned for an upcoming release. The following sections explain how this would work.

Database Fail-Over with MS SQL

A two-node SQL Server is required to be set up in a failover cluster mode. ExpertFlow applications will connect to the MS SQL Server cluster via a SQL user, to create application database schema. The cluster should be accessible through the unique SQL Server cluster VIP (Virtual IP).

ExpertFlow requires a SQL user with the database role db_owner on each of EF application’s database.

Application Fail Over

EF applications are deployed on a two-node cluster/ VMs with one node being the Primary and the other being the secondary to provide fault tolerance and high availability of the applications. Both of the two nodes have everything running on the machines, thus, acting as a replica of each other. Each node of the cluster exposes a VIP (Virtual IP, using VRR protocol) which routes service requests to the active primary node at any point in time.

The two nodes are synchronized with KeepAlived enabled such that as soon as the primary node becomes down, the secondary node resumes services and acts as primary. Thus, the failover from primary to secondary happens nearly seamlessly.

The overall architecture will look like below.

Note

Each Node represents a VM.

For the hardware resilience, these nodes/VMs should be on two physical servers. If all the VMs are deployed on the same physical server, it will provide only VM level resilience and fault tolerance.

Fail Over scenarios

When the Primary or Secondary node is down
The system will switch to the other active node.
Impact		Everything will continue to work after the seamless failover.
Recovery		NA

When Primary and Secondary nodes are down
Impact		All services will not work.
Recovery		Manual intervention will be required to recover the system.

When primary or secondary Application instance is down
The other active instance will make itself primary as it becomes aware of the failure of the primary instance using KeepAlived
Impact		All read/write operations on the application will continue to function after the failover. With MSSQL Server failover cluster, the data will be secured. However, if the instance goes down in between the processing of a request (such as while uploading data to the Cisco campaign), the data might be lost in that case and a re-import/upload might be required.
Recovery		For the integration between the IVR and the web app, the RESTful APIs used in the IVR scripts seamlessly failover to the secondary instance to upload the callback request. The failover for the web users will also be seamless with a virtual web portal IP.

When both the primary and the secondary application instances are down
Impact		All read/write operations on the datastore will fail.
Recovery		Manual intervention is required to resume the datastore operations.

When primary or secondary “Contact Feed” components are down
Impact		The other active instance on the other node will make itself primary as it becomes aware of the failure of the other instance using the Heartbeat mechanism
Recovery		All read/write operations on the application will continue to function after the failover. However, if this component fails during the time when the data is being processed/ uploaded (such as when the contacts are being fed to the Cisco dialer and a failover occurs meanwhile before the contacts are marked as “fed to dialer”), this may cause data inconsistency issues. No data will be lost.

When primary and secondary “Contact Feed” components are down
Impact		All read/write operations on the datastore will fail.
Recovery		Manual intervention is required to resume the datastore operations.

When primary or secondary “Sync Results” components are down
Impact		The other active instance on the other node will make itself primary as it becomes aware of the failure of the other instance using the Heartbeat mechanism.
Recovery		All read/write operations on the application will continue to function after the failover. However, if the component fails during a request processing (such as while updating the database tables), this may cause some data inconsistency issues. No data will be lost.

When primary and secondary “Sync Results” components are down
Impact		All read/write operations on the datastore will fail.
Recovery		Manual intervention is required to resume the datastore operations.