Back to DR planning basics
There haven't been many recent disasters, but that doesn't mean you can forget disaster recovery practices.
In the wake of Hurricanes Katrina, Wilma and company, there's been a sharp rise in the overall awareness of disaster recovery (DR) and the desire to improve DR capabilities. But the 2006 hurricane season represented something of a reversal from last year, and without the same high-profile disaster coverage, some companies may have been lulled into complacency. So it seems like a good time to reconsider some DR basics and discuss a lingering concern: the business-IT expectations gap.
I've written about business and IT alignment issues, a topic that has become standard fare in CIO-targeted publications. In the storage arena, we're seeing progress among forward-thinking companies toward better alignment of primary data and the beginnings of true business-driven, data-retention policies. However, DR is still one area where there's often a significant shortcoming.
In an ideal DR scenario, IT personnel would be prepared to handle multiple disaster categories with their actions dictated by clearly defined and documented policies and procedures. Application recovery would be prioritized on the basis of service levels driven by well thought-out recovery time objective (RTO) and recovery point objective (RPO) metrics. Much of the recovery would be automated through replication and other recovery tools.
Unfortunately, that scenario is more the exception than the rule for several reasons. First, like every other area within IT, DR is a balancing act; there's a trade-off between what we want to do and what we can afford. And by its very nature, DR is one of the biggest challenges IT faces. You must perform activities that are seldom done and accomplish them under the worst possible circumstances. Toss in the inherent limitations on planning and testing, and it's easy to understand why many IT professionals cringe at the mere mention of DR.
One of the fundamental prerequisites of successful DR planning is to understand the real requirements. What does the business need, and is it capable of addressing this need with regard to both capabilities and cost? As already noted, the key performance metrics to support this are RTO and RPO. Briefly, RTO is the maximum acceptable time to resume operations--not just to data recovery--and RPO is a measure of acceptable data loss.
The failure to understand and agree upon these metrics for critical applications, and the subsequent inability to invest in and develop capabilities to support them, is the basis for the DR gap between business and IT. Bridging this gap requires IT to meet with business and application owners to understand recovery needs so that the financial impact of outages can be quantified and then weighed against the cost of providing the necessary service level. This may require some negotiation, but the unavoidable truth is that without this conversation, DR success is impossible.
Building this capability goes well beyond a technology exercise. It consists of planning, identifying dependencies, developing processes and, above all, testing.
If you fail to plan, you plan to fail
A DR plan represents an organization's detailed roadmap of where to go, what to do and when to do it in the event of a disaster. It should incorporate actions that need to be performed before, during and after a disaster is declared. Among the more basic elements are defining the criteria under which a disaster is declared, who can declare it and how individuals are notified. The Gulf hurricane experiences reinforced the challenge and importance of communications, and a good plan should include contingencies; you can't assume e-mail, VoIP or even cell phone service is available.
We know that processes and procedures need to be documented, but we also know that most people hate to do it. Even the most carefully crafted DR plans will become useless without proper attention. DR needs to be baked into the standard change management process so that whenever systems are modified, software is patched or additional storage is assigned, the impact on DR is reviewed and the plan revised accordingly. Likewise, when reorganizations occur, the DR plan must be revisited.
It's clear that double-digit data growth rates dramatically impact the ability to recover within targeted time constraints, but application complexity and interdependence is an often-overlooked factor that has a major impact on recoverability. Today, major applications are spread across multiple servers and architectures. It's not uncommon for a mainframe application to feed other applications or subcomponents that reside on Unix or Windows platforms. Based on the traditional server-centric recovery perspective, it's possible to successfully back up or snapshot each application component but be unable to fully recover the application due to inconsistencies among the various components.
This situation can be avoided by first understanding the interdependencies among applications and then applying the appropriate data protection approach. The method could be the use of split mirror/replication technology featuring consistency groups that encompass the interdependent elements, or it might be continuous data protection (CDP) technology that can ensure highly granular, synchronized time-based rollback.
No testing, no DR
The planning part of DR is relatively easy compared to testing the plan. Testing is the part of DR that's often dreaded and, unfortunately, too often avoided. Yet without proper testing, one might as well not bother with the planning because the likelihood of successful execution is small.
Some fundamental considerations for testing include:
- Test application recovery, not just data recovery (think application interdependency)
- Let nonprimary individuals perform the recovery to validate procedures and documentation
- Construct multiple disaster scenarios and employ role-playing
- Establish a positive DR mindset: uncovering (and fixing) problems is a good thing
- Track metrics to measure and chart improvement
Expanding technology choices
It's easy to become disheartened when you consider the myriad challenges associated with building an effective DR strategy. But there's some good news: The number of available DR technology options is expanding while costs, though still not inconsiderable, are coming within the financial reach of more organizations. Rather than having to depend solely on backup tapes--and their associated multiday RTOs/RPOs--for DR, a variety of alternatives for budget-constrained companies are now available.
Replication services, for example, can be implemented at the application, database, host, network and storage array levels in ways that should address almost any combination of requirements and budget. For an environment with only a small number of critical applications requiring a short RTO, a host-based tool may do the job; in addition, many environments are successfully replicating databases using the log-shipping facilities native to many databases. Options like these are often overlooked in our storage-centric view. For environments needing to provide recovery services across a broader range of platforms and applications, storage-level replication services make sense; we're now seeing more robust replication, mirroring and snapshotting options offered for midrange arrays.
CDP technologies are also beginning to make their way into environments seeking to improve DR processes. Often introduced to protect a specific application or database, such as Exchange or Oracle, these tools can provide significant benefits especially with regard to ease of recovery.
The challenge is to not put the cart before the horse. Be aware that tool selection is the easiest (and--be honest--the most fun) part of the DR process, so there's a temptation to jump to this phase first. That would be a serious mistake. Without performing the heavy lifting first--the business-impact analysis and planning--the likelihood of overspending or coming up with an incomplete solution is highly probable.