At a time when businesses are more dependent than ever before on automation to enable fewer people to be more productive, disaster recovery (DR) planning is more important than ever. Even a minor disruption can have major consequences for business continuity.
That said, the truth is that many organizations have downsized their planning efforts in response to economic pressures. One significant area where cutbacks are occurring is in the allocation of funds to subsidize plan testing.
Testing is the long tail cost of disaster recovery planning. Once the critical assets of the company have been identified and objectives have been set for their timely recovery following an interruption event, planners typically go on a hunt for the right recovery technique to safeguard and restore the assets within the limitations imposed by technology availability and budget. Settling upon the techniques that fit the need, the plan is documented and bound and we get down to the real work of disaster recovery: routine testing.
Traditional testing follows a pattern. Annually, a test plan is prepared and presented to management so that resources can be pre-allocated for the coming series of test activities. Then, discrete plans are made for each of the testing events scheduled for the next 12 months. Each test event is executed per schedule, with recovery tasks and procedures played out in a nonlinear fashion to optimize test time. (To avoid wasting precious test time, interdependent tasks are tested separately -- spread over different test events -- to keep the failure of one test procedure from impacting the entire event.) Finally, test results are scrutinized to determine what changes need to be made to the overall plan design and what tasks need to be scheduled for re-testing.
This strategy for testing has typified DR for the past 30 years. It worked well in a mainframe environment, but proved to be quite a bit less efficient in the distributed computing world. Part of the explanation was the IT discipline, standardization, and logistical simplicity of mainframe operations and technology platform, which is not manifested in the Wild West of open systems. In open systems, we confront more complexity: platform configurations vary, incompatibilities between hardware create recovery platform design issues, application software is not designed for recovery, and so on.
Compounding the problem is the proliferation of recovery "solutions." Everybody wants in on the data protection act. Storage hardware vendors want to sell multi-hop mirroring or continuous data protection that work only between products bearing their corporate logo. Third-party data backup or high-availability software products have entered the market in droves. Virtualization vendors want to push the use of their cluster failover functionality. Cloud vendors want to sell data protection in the network. And application and operating system vendors now sport their own proprietary data protection schemes. Few companies today use only one of these techniques; usually, planners confront a mix of different data protection processes.
Corralling all of these different processes into a manageable whole is a significant challenge. Figuring out a coherent way to test them can be a nightmare.
This points to a significant deficit in contemporary planning: the failure to develop DR strategies with testing in mind. Planners tend to focus on how they will protect and recover assets, at what cost, and in what amount of time. Rarely if ever do we ask the question of how we will test the strategy.
"Testing as an afterthought" contributes significant cost to the maintenance of a plan. It means that every test event that is scheduled must cover three buckets of tasks: data management, infrastructure replacement, and logistical resupply. In other words, the test must validate 1.) that the right data is being protected and is available in some form for recovery; 2.) that the right infrastructure exists (or can be built rapidly) to re-host applications and data assets; and 3.) that the right people and processes have been identified for recovery and that they all know their jobs.
The truth is, with proper attention paid to how continuity procedures will be tested at the time that selections are made of various techniques for data protection and recovery, we could eliminate the need to test data management and infrastructure replacement altogether as part of an annual test regime. This would amount to building in recovery testing, rather than bolting it on.
Deploying a "wrapper solution" (some call these software packages geo-clustering or site recovery management solutions) could provide a partial solution. Products like CA XOsoft, NeverFail Group NeverFail, and others enable you to monitor data protection processes on an on-going basis. They can alert you when the state of protected assets changes -- data moves to a new physical location or exceeds the time-to-data parameters that have been established in recovery objectives. They can confirm that backups or mirrors are happening and are complete. With some products, you can actually simulate or perform live failovers to alternate platforms on a daily basis if you want to confirm that the shadow infrastructure is still up to hosting the workload you will place on it.
While wrappers are imperfect solutions, their judicious use can reduce the testing workload and enable DR planners to use test time much more efficiently -- and cost-effectively. These days, when funding is a challenge in most businesses, we all need to do what we can to reduce testing costs.
About this author: Jon Toigo is a veteran planner who has assisted more than 100 companies in the development and testing of their continuity plans. He is the author of 15 books, including five on the subject of business continuity planning, and is presently writing the 4th Edition of his "Disaster Recovery Planning" title for distribution via the Web. Readers can read chapters as they are completed, and contribute their own thoughts and stories, at http://book.drplanning.org commencing in August 2009.
This was first published in July 2009