In a recent conversation I had regarding disaster recovery (DR) planning, a CIO remarked that he'd like to achieve...
what he called "provable" disaster recovery strategy. But achieving disaster recovery "provability," or at least greater predictability, remains a challenge. Fundamentally, disaster recovery has a number of moving parts. It's fairly easy to deal with one component of disaster recovery and for it to perform reasonably well. The hard part is coordinating and synchronizing the various elements so they function together. Here are eight tips to help you establish a more reliable disaster recovery strategy:
- Clearly defined organizational responsibilities. Roles and responsibilities is a major area where organizations fall short with regard to disaster recovery. The DR process is much more than restoring or replicating data; it's about ensuring that the applications and systems they support can be returned to functional business usage. Accomplishing this requires participation from groups outside of IT, including corporate governance and oversight groups, finance groups and the business units impacted.
- Validate the business impact analysis (BIA) process. Technically, the BIA isn't part of the disaster recovery process -- it's a prerequisite that forms the foundation of DR planning. In a perfect world, the output of a business impact analysis would define the kinds of recovery capabilities IT must design and deliver in support of the business. The real world, unfortunately, isn't so simple. Information is often incomplete, and we need to make assumptions to fill in the gaps.
- Define and tier application recovery services. When business executives hear IT people talking about disaster recovery strategy, they're thinking cost. With DR comes insurance, and because no one wants to spend too much on insurance, efficiency is vital. While there are significant fixed costs inherent to DR -- a recovery site, for example -- there are also a substantial number of variable costs that can be controlled. The key is to realize that not every application requires a two-hour recovery time. Establishing a catalog of services based on business impact analysis requirements that provide several levels of recovery, and then aligning applications appropriately is one way to contain costs. With multilevel recovery services, applications can be prioritized according to importance. Among the business attributes that should be defined within the service catalog are risk (usually expressed in terms of recovery time objective and recovery point objective), quality of service (including performance and consistency levels) and cost.
- Implement a comprehensive cost model. While the business impact analysis determines the impact of downtime to a line of business, and tiered recovery services provide a catalog of services that align with business requirements, there also needs to be a method to determine and allocate the cost of those services. Corporate governance may help set thresholds for recovery and imply minimum levels of protection, but the service level is greatly influenced by cost. The cost model should calculate the per-unit total cost of ownership that would be charged to the business for any given service offering. Among the items included in such a cost model are personnel, facilities, hardware and software, maintenance and support. Having this data available helps significantly in aligning "want" with "need," and is a critical success factor in delivering these services efficiently.
- Design an effective disaster recovery infrastructure. The disaster recovery infrastructure must support the business impact analysis requirements and service-level targets. While disaster recovery is an extension of operational recovery capability, factors such as distance and bandwidth also come into play. The good news is that the number of remote recovery options available to architects and designers has increased dramatically over the past few years. Traditional storage mirroring and replication are more broadly available on a wide range of systems, and compression and deduplication technologies can reduce bandwidth requirements. In addition, technologies like server virtualization can dramatically improve remote recoverability.
- Select the right target recovery site. Disaster recovery site selection often presents a challenge. Organizations with multiple data centers can develop cross-site recovery capabilities; if you don't have that option, selecting a DR site can easily become the biggest challenge in getting disaster recovery off the ground. Key concerns include the levels of protection needed, and whether to own or outsource disaster recovery (and to what degree). The two chief, and often competing, factors to consider are risk and convenience. Planning for protection against a regional disaster means that many DR sites get pushed far away from headquarters, where most of the IT staff is housed. Service recovery levels will determine whether the site is a "hot," "warm" or "cold" site. This is a critical designation because there's a substantial difference in the fixed cost of each. Generally, recovery time objectives (RTOs) of less than a day require a hot site. The question of outsourcing depends on the desired degree of control, guarantees of infrastructure availability at a given location and, of course, cost.
- Establish mature operational disciplines. Some people point out that one of the best ways to improve disaster recovery is to improve production. Put another way, if normal day-to-day operations don't tend to function well, neither will your disaster recovery plan. Therefore, operational discipline is an essential element of predictable DR. The first sign of a potential operational deficiency is the lack of documentation for key processes. Given that disaster recovery, by definition, occurs under seriously sub-optimal conditions, the need for well-documented standard operating procedures is clear. Organizations that have established and actively embraced standard frameworks, like the Information Technology Infrastructure Library (ITIL), are significantly improving their odds of recoverability in the chaotic atmosphere of a disaster situation.
- Develop a realistic testing methodology. Given the operational disruption, practical difficulties and costs involved, we tend to focus our testing on those components that are easy to test. But realistic testing is just that -- testing real business function recovery. While it's necessary to perform component testing on a regular basis, it's equally important to test the recoverability of large-scale functions to ensure that interoperability and interdependency issues are consistently addressed. The closer to a real production environment a test can get, the more "provable" the DR capability.
The elements outlined here transcend the boundaries of the IT infrastructure. It's therefore critical for IT administrators to have a strong understanding of the problems at hand and to learn how to address them so they can influence strategic disaster recovery decision-making wherever possible. This will help them avoid being placed in the Catch-22 situation of solving a problem, over which they have no control.
This article originally appeared in Storage magazine.
James Damoulakis is CTO of GlassHouse Technologies, an independent storage services firm with offices across the United States and in the UK.