High Availability (HA) / Duplex Deployment
High Availability in ECM is planned for an upcoming release. The following sections explain how this would work.
Database Fail-Over with MS SQL
A two-node SQL Server is required to be set up in a failover cluster mode. ExpertFlow applications will connect to the MS SQL Server cluster via a SQL user, to create application database schema. The cluster should be accessible through the unique SQL Server cluster VIP (Virtual IP).
ExpertFlow requires a SQL user with the database role db_owner on each of EF application’s database.
Application Fail Over
EF applications are deployed on a two-node cluster/ VMs with one node being the Primary and the other being the secondary to provide fault tolerance and high availability of the applications. Both of the two nodes have everything running on the machines, thus, acting as a replica of each other. Each node of the cluster exposes a VIP (Virtual IP, using VRR protocol) which routes service requests to the active primary node at any point in time.
The two nodes are synchronized with KeepAlived enabled such that as soon as the primary node becomes down, the secondary node resumes services and acts as primary. Thus, the failover from primary to secondary happens nearly seamlessly.
The overall architecture will look like below.
Fail Over scenarios
When the Primary or Secondary node is down | |||||||
The system will switch to the other active node. | |||||||
Impact | Everything will continue to work after the seamless failover. | ||||||
Recovery | NA |
When Primary and Secondary nodes are down | |||||||
Impact | All services will not work. | ||||||
Recovery | Manual intervention will be required to recover the system. |
When primary or secondary Application instance is down | |||||||
The other active instance will make itself primary as it becomes aware of the failure of the primary instance using KeepAlived | |||||||
Impact | All read/write operations on the application will continue to function after the failover. With MSSQL Server failover cluster, the data will be secured. However, if the instance goes down in between the processing of a request (such as while uploading data to the Cisco campaign), the data might be lost in that case and a re-import/upload might be required. | ||||||
Recovery | For the integration between the IVR and the web app, the RESTful APIs used in the IVR scripts seamlessly failover to the secondary instance to upload the callback request. The failover for the web users will also be seamless with a virtual web portal IP. |
When both the primary and the secondary application instances are down | |||||||
Impact | All read/write operations on the datastore will fail. | ||||||
Recovery | Manual intervention is required to resume the datastore operations. |
When primary or secondary “Contact Feed” components are down | |||||||
Impact | The other active instance on the other node will make itself primary as it becomes aware of the failure of the other instance using the Heartbeat mechanism | ||||||
Recovery | All read/write operations on the application will continue to function after the failover. However, if this component fails during the time when the data is being processed/ uploaded (such as when the contacts are being fed to the Cisco dialer and a failover occurs meanwhile before the contacts are marked as “fed to dialer”), this may cause data inconsistency issues. No data will be lost. |
When primary and secondary “Contact Feed” components are down | |||||||
Impact | All read/write operations on the datastore will fail. | ||||||
Recovery | Manual intervention is required to resume the datastore operations. |
When primary or secondary “Sync Results” components are down | |||||||
Impact | The other active instance on the other node will make itself primary as it becomes aware of the failure of the other instance using the Heartbeat mechanism. | ||||||
Recovery | All read/write operations on the application will continue to function after the failover. However, if the component fails during a request processing (such as while updating the database tables), this may cause some data inconsistency issues. No data will be lost. |
When primary and secondary “Sync Results” components are down | |||||||
Impact | All read/write operations on the datastore will fail. | ||||||
Recovery | Manual intervention is required to resume the datastore operations. |