High Availability Introduction

21 min

enterprise applications like activate must be continuously available and downtime minimised to maintain business operations this article outlines key strategies and recommendations for implementing a high availability (ha) architecture for activate ha architecture for activate design 1 standard activate architecture this design emphasises failover and disaster recovery using secondary data centres or regions, log shipping, and snapshot replication the architecture features a manual failover process with an rpo of 1 to 4 hours, minimising data loss during disaster recovery this configuration provides a scalable infrastructure, offering secure access for external clients and third party integrations while ensuring high availability and reliable disaster recovery capabilities design 2 alternative three tier architecture this design is a three tier high availability setup, ensuring that external and internal client requests are managed through load balancers, secure https connections, and a disaster recovery mechanism the architecture efficiently handles large scale operations with redundancy at the application and database tiers using sql always on availability groups and automatic failover for minimal downtime the key difference between design 1 and design 2, is that in design 1 the orchestrator and web services are on the same server best practices for ha in activate database resilience all configuration and data for activate is stored in a sql database implement always on availability groups with sql server to enable synchronous replication across multiple databases for immediate failover regularly schedule snapshot backups and conduct tests of failover procedures to ensure business continuity during a disaster disaster recovery plan (drp) your drp should include an rto/rpo matrix that outlines the time frames and acceptable data loss for different failure levels consider geo redundant backups and failover sites to ensure business continuity even during large scale disasters zero trust architecture use a zero trust model across your infrastructure, where every access request is validated implement tls encryption for all traffic to prevent security breaches and data compromise load balancing and scaling use load balancers to distribute traffic across multiple application and database tier instances, minimising the risk of bottlenecks and improving fault tolerance monitoring and alerts set up real time monitoring of database, application, and network services automated alerts should trigger scaling and failover mechanisms if the thresholds of cpu, memory, or i/o operations are breached ha benefits of activate's three tier architecture high availability (ha) is a critical strategy for ensuring that business applications remain operational with minimal downtime activate's three tier architecture; the presentation layer, business logic layer, and data layer, is designed to maximise uptime, improve scalability, and enhance maintainability below, we will explore the benefits of this architecture in the context of ha this separation of responsibilities improves fault tolerance and simplifies troubleshooting and maintenance during system disruption fault tolerance in a high availability setup, the three tier architecture allows each layer to handle faults independently for example if the web server or application server in the business logic layer fails, the job service and web services can continue to operate this enables queued processes to proceed once the affected server is restored database operations can continue in the data layer, leveraging sql clustering or always on availability groups to ensure that data is not lost even during a server failure additionally, distributing the workload across overall performance tiers improves load balancing ensures that spikes in traffic are handled efficiently without bottlenecking the system, while the job service processes automated tasks in the background to prevent delays for end users minimised impact of update and maintenance the modular structure of the three tier architecture allows maintenance and updates to be performed on one layer without affecting the others business logic updates can be deployed without impacting the user interface or data storage database updates or optimisations can occur in the data layer without causing downtime for users interacting with the web portal or other application services this allows for more frequent system updates and improvements, enhancing the overall user experience while maintaining high availability disaster scenarios scenario 1 temporary activate orchestrator outage activate orchestrator service experiences an outage that takes only 1 hour to restore to normal functionality recovery steps activate will automatically recover with mechanisms in place to ensure that service interruptions are mitigated activate web service requests will queue up in the activate database ready to be processed once orchestrator service is restored once restoration of the orchestration service is complete, queued requests will be processed in the order they were submitted this ensures continuity with minimal manual effort required for synchronisation or recovery activate will automatically sync with active directory (ad), azure ad, and other managed systems this automatic synchronisation ensures that any changes made during the downtime are reconciled without requiring manual intervention scenario 2 complete outage on web, orchestrator or database servers this is when the web, orchestrator and/or database services become completely unavailable recovery steps a complete switchover to the dr environment would be undertaken update dns to route to dr web server enable dr servers to take over the roles of the primary servers on the orchestrator dr server, enable and start the activate orchestrator windows service on the web dr server enable and start the activate web service the dr database server as a replicated copy will take over where the primary left off scenario 3 catastrophic failure requiring restoration from database backup catastrophic database failure between 5am and 11am, with a backup created at 5am recovery steps 5am backup is restored to the primary (or dr) database server any orphaned workflows (e g approval processes) that were active but not captured in the 5am backup may need to be re initiated manually after restoration after restoring the sql database, activate will sync to its most recent state if any direct changes were made in active directory or other systems during the downtime, these will automatically be reconciled upon system recovery conclusion activates three tier architecture delivers high availability, scalability, and maintainability by separating layers for improved fault tolerance and disaster recovery strategic planning focused on reducing downtime (rto) and minimising data loss (rpo), aligned to your business continuity plan (bcp), ensures that activate can handle unexpected disruptions effectively leveraging best practices like always on availability groups, load balancers, and zero trust architecture ensures that your system remains resilient, meeting your business's critical rto and rpo targets additional information rto, rpo and bcp alignment when designing a high availability architecture, it is crucial to define the recovery time objective (rto) and recovery point objective (rpo) as part of the business continuity plan (bcp) these metrics determine how quickly and how much data you can afford to lose in the event of a disaster or failure recovery time objective (rto) rto specifies the maximum acceptable downtime depending on the redundancy, failover, and replication setup, a well architected activate ha solution's rto should typically range from minutes to a few hours shorter rtos require advanced failover mechanisms, such as automated database switchover and real time monitoring of application tiers example primary to secondary data center failover an rto of 15 30 minutes can be achieved using automatic failover mechanisms in sql always on availability groups and active active setups in the application tier recovery point objective (rpo) rpo defines the maximum allowable data loss in activate’s ha architecture, using sql always on availability groups with synchronous replication ensures that data loss is minimal to none, ideally achieving an rpo close to zero however, rpo could be measured in minutes to hours in asynchronous replication setups (e g log shipping) example asynchronous replication to dr site in scenarios where data replication is asynchronous, rpo could be as high as 5 10 minutes this is ideal for disaster recovery, where non critical operations can tolerate slight data loss key consideration for bcp alignment critical business functions identify mission critical applications and determine the acceptable downtime and data loss levels for those systems redundancy strategy implement database mirroring and replication strategies that balance cost with the required levels of data availability regular failover testing regularly test failover and failback procedures to ensure they meet bcp requirements data retention policies ensure that data backups and retention policies align with organisational recovery needs and legal requirements summary table this table provides a quick reference for calculating downtime, rto, and rpo based on availability percentages for systems requiring high availability availability downtime per year rto (recovery time objective) rpo (recovery point objective) 99 99% 52 56 minutes 5 10 minutes 5 10 minutes 99 95% 4 38 hours 1 4 hours 4 5 hours references business continuity and database recovery sql server | microsoft learn https //learn microsoft com/en us/sql/database engine/sql server business continuity dr?view=sql server ver16 how to evaluate your rpo and rto for business continuity https //www linkedin com/advice/3/how do you balance cost benefit achieving your