Disaster recovery plan testing primer: Test to fail

Disaster recovery testing is highly valued among standards and DR/BC organizations, but these tests are only effective if you perform them correctly.

According to many standards institutions and organizations that focus on disaster recovery (DR) and business continuity (BC), disaster recovery plan testing will often result in the continued success and operations of a business, even in times of a disaster.

For example:

An organization's business continuity and incident management arrangements cannot be considered reliable until exercised and unless their currency is maintained. -- BS 25999 (British Standards Institution [BSI])

Business continuity plans should be tested and updated regularly to ensure that they are up to date and effective. -- ISO 27002 (International Organization for Standardization)

The entity shall evaluate program plans, procedures, and capabilities through periodic reviews, testing, and exercises. -- NFPA 1600 (Standard for Disaster/ Emergency Management and Business Continuity)

More on disaster recovery testing
How often should I conduct a disaster recovery (DR) test?

Top reasons why disaster recovery tests fail

Achieving cost-effective disaster recovery testing and planning: Nine areas where you can cut costs

So if everyone agrees that testing of business continuity/disaster recovery plans is a genuine, certified good thing, then there's nothing to argue about here, right? I, however, have reason to disagree with the claimed success of disaster recovery testing. I've seen too many examples of DR plans that have been tested routinely over extended periods of time, but still fail when needed.

Some of the problems that arise with the assurance of DR testing has to do with definitions. A quick glance at the statements extracted from the best-known business continuity management standards shows that the words test, exercise, review and rehearse are used in an overlapping manner, if not interchangeably. Some definitions include "testing equipment" and "exercising people," but these terms can be confusing, and moreover, a lot of the disaster recovery tests may not be carried out correctly due to human error. People run equipment and often make unnecessary and unexpected mistakes under pressure. And who today can get meaningful work done without the necessary equipment?

Demonstrations, not disaster recovery plan tests

I have seen entirely too many companies sign up at their commercial recovery service for their allotted 48 hours of test time and force a small coterie of specialists through two days of hell so they could return home and announce that everything went well again this year. They were not testing; they were demonstrating. They were showing that a limited team of well-trained individuals can perform tasks very much like their routine jobs at a distant location that has become familiar to them over time.

Now, there is some value to a demonstration. It allows management to reassure regulators that they are doing what is expected of them, and it makes auditors happy. But it does not validate that a set of procedures would be effective if carried out without key personnel, without advance planning and without the pressure of an actual emergency. To use a sports analogy, this sort of "testing" is practice, admittedly a necessity for success at game time. But it is not at all the same thing as playing for keeps.

Finding defects in disaster recovery plan testing

Successful tests do not prove that a disaster recovery plan will succeed, but failed tests do prove that plan will fail. And that is what makes testing so important.

Business continuity plans and disaster recovery plans are engineered products constructed by fallible human beings. Like all engineered products, they have defects, many of which go unnoticed for a very long period of time until a certain set of circumstances align to show the flaw. Most often, if a disaster recovery plan is going to fail, it will most likely happen during a disaster. Therefore, if a test detects a defect under relatively ideal conditions, it enables enhancements to be made before the plan is ever needed.

A disaster recovery plan is never a finished document and probably inaccurate due to the constant erosion caused by changes to the business, technology, personnel, etc. Maintenance to a DR plan is necessary but sometimes insufficient if flaws in the original plan exist. Because of that, there are many maintenance activities that need to be tested to find defects introduced by the fixes, and that cycle can go on continuously. Some recovery processes are incredibly complex, such as ERP system, a non-standard file system or a multi-site integrated application. Changes to repair a flaw in one of these processes is likely to introduce others.

Independence in testing

Tests are conducted routinely, but often are only conducted by one person, who most likely over time has had disaster recovery testing become part of their job description and responsibility. Often since this person creates the DR tests, only he/she understands the mental shorthand that is written into the plan. And because this person makes the plan easy to carry out by themselves, he/she has automatically introduced the very source of failure. For when the plan is needed, there is no assurance that that person will still be employed, not on vacation and not injured in the event that caused the plan to be needed.

To make sure your DR test doesn't fail, be sure to take these items into consideration:

  • When a disaster recovery plan is newly created, it is legitimate to demonstrate it. There will be enough kinks to iron out that there is no additional need to complicate the testing process. But thereafter, develop test scenarios that are intended to simulate the chaotic reality of a disaster (e.g., a key person is not available; a vital backup tape cannot be read, a software patch has not been applied to the recovery version of the operating system, etc.).
  • Have someone other than those who are conducting the test construct the scenario. If you know where the punches are coming from it is easier to duck. It is just human nature to make the test easier to pass by formulating an easily soluble case.
  • An independent person or group should referee every disaster recovery plan test. It is easy to declare victory when the testers are the only ones present, but much more difficult if there is a gimlet-eyed auditor present. However, the observant eyes don't necessarily need to belong to auditors; anyone independent will do, such as consultants, vendors or technical personnel from other divisions on a mutual basis.
  • To the degree that testing indicates something other than total success, any shortcomings noted should be considered as defects in the disaster recovery program as well as its resulting plans. Once defects are recognized and categorized defects, resolve any problems and determine their causes. Implement preventive and detective controls to identify and track defect recurrence and diminution (or growth). All findings should be communicated to management.
  • Resolution of defects must be reflected in the testing that identified them. The same test should be re-performed with the resolutions in place to determine if they are effective in eliminating the defects. This may require several iterations of testing, so waiting a year for the next test is insufficient. Be sure to document the results of the re-testing, as well as to develop and implement testing methods to identify possible defect recurrence.

Overall, disaster recovery tests are essential to execute and demonstrate, but be cautious and take the correct steps to test your DR plans. Otherwise, your plan might fail you in any given disaster recovery situation.

Dig Deeper on Disaster recovery planning - management