One of the fundamental prerequisites of successful disaster recovery (DR) planning is to understand the real requirements. What does the business need, and is it capable of addressing this need with regard to both capabilities and cost? The key performance metrics to support this are recovery time objective (RTO) and recovery point objective (RPO). Briefly, RTO is the maximum acceptable time to resume operations -- not just to data recovery -- and RPO is a measure of acceptable data loss.
The failure to understand and agree upon these metrics for critical applications, and the subsequent inability to invest in and develop capabilities to support them, is the basis for the disaster recovery gap between business and IT. Bridging this gap requires IT to meet with business and application owners to understand recovery needs so that the financial impact of outages can be quantified and then weighed against the cost of providing the necessary service level. This may require some negotiation, but the unavoidable truth is that without this conversation, DR success is impossible.
Building this capability goes well beyond a technology exercise. It consists of planning, identifying dependencies, developing processes and, above all, testing.
If you fail to plan, you plan to fail
A disaster recovery plan represents an organization's detailed roadmap of where to go, what to do and when to do it in the event of a disaster. It should incorporate actions that need to be performed before, during and after a disaster is declared. Among the more basic elements are defining the criteria under which a disaster is declared, who can declare it and how individuals are notified. In the past, hurricane experiences reinforced the challenge and importance of communications, and a good plan should include contingencies; you can't assume your email will work, or even that cell phone service will be available.
More on disaster recovery planning
Disaster recovery planning and operations tutorial
SteelEye launches SteelEye Protection Suite for Windows Server
We know that processes and procedures need to be documented, but we also know that most people hate to do it. Even the most carefully crafted disaster recovery plans will become useless without proper attention. Disaster recovery needs to be baked into the standard change management process so that whenever systems are modified, software is patched or additional storage is assigned, Then the impact on the DR plan is reviewed revised accordingly. Likewise, when reorganizations occur, the disaster recovery plan must be revisited.
It's clear that double-digit data growth rates dramatically impacts the ability to recover data within targeted time constraints, but application complexity and interdependence is an often-overlooked factor that has a major impact on recoverability. Today, major applications are spread across multiple servers and architectures. It's not uncommon for a mainframe application to feed other applications or subcomponents that reside on Unix or Windows platforms. Based on the traditional server-centric recovery perspective, it's possible to successfully back up or snapshot each application component but be unable to fully recover the application due to inconsistencies among the various components.
This situation can be avoided by first understanding the interdependencies among applications and then applying the appropriate data protection approach. The method could be the use of split mirror/replication technology featuring consistency groups that encompass the interdependent elements, or it might be continuous data protection (CDP) technology that can ensure highly granular, synchronized time-based rollback.
No testing, no DR
The planning part of disaster recovery is relatively easy compared to testing the plan. Testing the DR plan is often dreaded and, unfortunately, often avoided. Yet without proper testing, one might as well not bother with the planning because the likelihood of successful execution is small if you have not tested your plan properly.
Some fundamental considerations for testing include:
- Test application recovery, not just data recovery (think application interdependency)
- Let nonprimary individuals perform the recovery to validate procedures and documentation
- Construct multiple disaster scenarios and employ role-playing
- Establish a positive disaster recovery testing mindset: uncovering (and fixing) problems is a good thing
- Track metrics to measure and chart improvement
The most common reason given for not doing more extensive testing is cost. This will inevitably be a point of contention because DR testing is viewed as an exception to what are commonly thought of as day-to-day operations. The only way to effectively address this issue and justify the cost is by closely linking the testing process to RTO/RPO service-level objectives. This means the disaster recovery business case, particularly the financial impact of RTO/RPO, must be accurate and complete. The message should be that comprehensive testing is an essential requirement to ensuring that those metrics can actually be met and is an integral part of the recovery process.
This article originally appeared in Storage magazine.
James Damoulakis is CTO of GlassHouse Technologies, an independent storage services firm with offices across the United States and in the UK.