The primary mission of a disaster recovery plan is to restore critical business systems to a normal or near-normal condition following a disruptive
incident. Among the popular strategies for recovering systems and other IT assets are redundant hardware components, access to spare parts and components for critical systems, diversely run network infrastructures, and rapid failover and failback technologies to quickly switch from a primary to alternate platform.
In this tutorial on high-availability disaster recovery planning, we’ll examine another popular technology disaster recovery technique—high availability—and the role it plays in a DR plan. In today’s world of very short-duration recovery time objectives (RTOs) and mission-critical systems, technology must be sustainable in the face of potentially disastrous events. High-availability systems can help achieve very low or near-zero RTOs.
What is high availability?
Think of high availability as a system design or methodology that can satisfy specific—and contractual—uptime performance goals. Simply stated, high-availability solutions minimize downtime and maximize availability. Downtime can be scheduled (e.g., installing patches or updating common equipment) or unscheduled (e.g., power outages or component failures). Whatever the case, high-availability systems are designed and configured to remain operational during and after disruptive events.
We often think of availability in terms of percentages of uptime during a typical year. Table 1: Application availability shows the familiar “Table of 9s,” as applicable to availability.
Table 1: Application availability
|Availability %||Downtime per year||Downtime per month||Downtime per week|
|99.9% ("three nines")||8.76 hours||43.2 minutes||10.1 minutes|
|99.99% ("four nines")||52.56 minutes||4.32 minutes||1.01 minutes|
|99.999% ("five nines")||5.26 minutes||25.9 seconds||6.05 seconds|
Note that uptime and availability are not the same. A system can be considered “up” but “not available,” such as what happens in a network outage. A server can power up, but an internal firmware glitch might make it unavailable to users.
Human intervention in systems is a major cause of downtime. The inclusion of redundancy at the component and software levels can greatly reduce the need for human interference. Passive redundancy is achieved by building in additional operational capacity to handle declines in performance. Active redundancy implies adding a level of intelligence to the system that senses performance changes and dynamically reconfigures the system to bypass failing components in real-time. In many parts of the financial sector, for example, zero downtime is the design goal. This is achieved by increased component-level redundancy as well as more sophisticated failure management algorithms that spot tiny variations in performance levels and address them before they escalate to something disruptive.
Fault-tolerant systems and disaster recovery
A popular capability of a high-availability system is fault tolerance. This means the system can continue operating despite the failure or degradation of certain components. Fault-tolerant systems have the ability to isolate the failed component, containment capability to limit failure propagation, reversion modes to return to a previous state of operation and do not have a single point of failure. Fully normal operations may not be possible, but a near-normal level of operation may be sufficient.
Graceful degradation is another term that describes a fault-tolerant system, in that the system does not experience a hard failure or shutdown. Rather, the system is designed to decline in performance so as to minimize any disruptions to overall system operations. Service-level agreements (SLAs) can be used to define the levels of performance degradation, if needed.
The value of high availability to disaster recovery programs
If your business can survive with occasional system disruptions such as maintenance or even unplanned incidents, a HA solution may not be needed. However, if your business cannot afford any kind of downtime, you should consider a HA system. HA systems can increase the likelihood that your DR plan will succeed because its primary mission is to ensure that critical systems remain operational.
A disaster recovery plan without HA technology could have a major affect on an organization during a severe disruption, such as a fire that destroys a data center. But for most organizations where some level of downtime is acceptable, high-availability systems are nice to have, and the costs of HA systems vs. the benefit they provide must be carefully analyzed.
Building high availability into a disaster recovery plan
In a properly designed IT disaster recovery plan, each hardware platform, operating system, application, database management system and network infrastructure component should have steps defined for recovery and restoration. In a disaster, those steps are executed to ensure that the system or software is recovered and returned to a normal or near-normal state.
High-availability systems are designed to require minimal human intervention, but can you assume that, in a disaster, the HA systems will be unaffected? It’s best to work with your HA vendor to identify and incorporate procedures in your DR plan that will ensure that your HA system remains operational. During disaster recovery exercises, be sure to include your HA systems if possible, but check with your HA vendor first to see what they recommend when you plan to conduct an exercise. Better yet, invite your HA vendor to attend the exercise.
When executing your DR plan’s system-level recovery procedures, the following high-level steps, as shown in Figure 1 below, should occur:
Whether you are using non-HA or full-HA systems, the above steps should occur as part of your DR plan procedures. Check with your vendors for special provisions that are unique to their systems and include them in your process-level system plans.
A sampling of high-availability vendors
There are a number of vendors that offer high-availability products. Vision Solutions offers high availability solutions that capture changes to data, applications and critical system objects in real time and replicates them to alternate (backup) servers. It enables controlled and non-intrusive switching, as well as failover that is fast and reliable. The company’s Double-Take systems are widely used for system failovers and failbacks.
Neverfail Group offers Continuous Availability, which protects data through continuous replication, monitors the health and state of business applications and enables automated failover. If a problem occurs, Neverfail invokes pre-emptive and corrective actions including fully coordinated failover of all components within the ecosystem.
Marathon Technologies offers everRun SplitSite to achieve high availability and disaster resilience by maintaining application availability between systems located in separate sites. The product supports geographic separation with synchronization links routed over a switched WAN. Sites can be split between data centers in different locations and at varying degrees of geographic separation.
Good technology disaster recovery plans simplify the process of recovering critical systems in the aftermath of a disaster. Keeping systems available and ready for use at all times is what high availability solutions do best. If your business requires available and uninterrupted operations, a regularly exercised and properly documented DR plan that includes high-availability systems is the best solution.
About this author: Paul Kirvan, CISA, CSSP, FBCI, CBCP, has more than 20 years experience in business continuity management as a consultant, author and educator. He has been directly involved with dozens of IT/telecom consulting and audit engagements ranging from governance program development, program exercising, execution and maintenance, and RFP preparation and response. Kirvan currently works as an independent business continuity consultant/auditor and is the secretary of the Business Continuity Institute USA chapter and can be reached at firstname.lastname@example.org.
This was first published in July 2011