alphaspirit - Fotolia

Manage Learn to apply best practices and optimize your operations.

Five disaster recovery test plan mistakes to avoid

Expert Paul Kirvan details five disaster recovery test plan errors to eliminate from your pre-test buildup, and some considerations for during and after testing.

A great deal of work goes into a disaster recovery test plan. Among these activities are:

  • Identifying the test team.
  • Securing the test team's availability on the planned DR test date.
  • Identifying additional subject matter experts such as database administrators and network engineers.
  • Determining what will be tested and the expected outcomes.

During the actual disaster recovery test, the game plan must be ready to use, the testing procedures documented and shared with all players, and the systems and network resources should be in place. But despite all this preparation, here are five mistakes that could happen when planning for and conducting your DR test plan:

Mistake 1: Not having a disaster recovery test plan

You need to have a documented plan of action that clearly outlines pre-test activities, test procedures, post-test wrap-up activities, test scripts and test team members. This written plan is also important from an audit perspective, as it provides auditors with evidence of how the test was planned and executed.

Mistake 2: Not having all the proper technology representatives present

While you may think you have all the necessary players identified for the DR test plan, double-check to make sure a key subject matter expert, such as the system's developer, has not been overlooked. Ensure you have all the technology elements represented -- network, hardware, application, database, utilities -- and any other special arrangements that are unique to the system being recovered.

Mistake 3: Not having a detailed script of the test activities

Aside from the disaster recovery test plan itself, this is one of the most important technical documents you'll have for the test. Scripts are the playbook used to execute tests, and their accuracy is essential. Even transposing two characters in a line of code can cause a system test to fail. If possible, have a subject matter expert review the test script to identify any possible changes or edits. It might also be useful to have a dry run of the test before going live.

Mistake 4: Not having the test elements in place and ready to use

If your test is for a certain day, you'll be expecting all the components needed for the DR test to be in place. If one or more of your hardware, software or network elements are not ready, cancel and reschedule the test so you won't disrupt production systems.

Mistake 5: Not checking to see if others have scheduled tests

In a medium- to large-sized IT department, there will likely be many scheduled activities in place, such as new system acceptance testing, network upgrades and software upgrades. This means your test will need to fit in with the other activities, so there are no disruptions to scheduled work. DR testing can often take several hours, so it should be scheduled as far in advance as possible. You should also send out notifications to other IT leads advising them of your planned test.

Additional disaster recovery testing plan considerations

While the above five mistakes should be avoided before your plan is tested, here are three factors to consider during and after the test.

  1. Not being willing to halt the test and reschedule if things are going poorly. If things are not going as planned, stop the test and review what happened up to the point where the test systems were no longer performing as expected. If it's possible to bypass the element that failed and continue with the test, continue with the original script.
  2. Not documenting what happens during the test. This includes recording when steps are completed and identifying modifications to the test script that are developed "on the fly." One of the key players in a disaster recovery test is a scribe -- the person taking notes of what is happening. Another is a timekeeper, who notes the specific start and end times of test activities.
  3. Not preparing an after-action report summarizing the test, lessons learned and so on. The scribe's and timekeeper's notes will be used to prepare an after-action report on the disaster recovery test. Auditors will want to review how the test was conducted, actions of the specific test participants, how specific issues were handled, any infrastructure elements that didn't work and the results achieved.

When you have a disaster recovery test plan and conduct a system and/or technology test, you can increase the odds of a successful test with sufficient preparation and properly scripted actions.

Next Steps

Ensure your DR plan testing runs efficiently

Lack of funds cited as reason not to have a DR plan

Widely used standards for business continuity and disaster recovery

Dig Deeper on Disaster recovery planning - management

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

What do you consider the most important element of a disaster recovery test plan?
I think the most important thing is to ensure that your plan allows you to quickly and viably address every part of your DR strategy. To do that, you need to perform a test plan review on a regular basis, with the people that will be implementing the plan when the time comes. Just like code review can catch coding errors early on, a test plan review can do the same for planning errors.
Hi, mcorum--I agree 100% and would add to that change management. Even the best DR plan is likely to fail if there isn't a system to keep track of changes in the environment that will affect application recoverability.
Make sure all steps are being done. Don't assume. Have a check list and check it often. I have been through one bad recovery years ago. It was a mess. Partially due to a very poor tape library. We had to piece together our system and re enter a lot of data manually. IT took almost 2 weeks to get back to a usable state.  Still we may have missed a thing or two. We didn't know..

All, Thanks for putting this together!

Prior to execution, making it crystal clear that no actions are to occur until; unless (assuming you're the lead), Actors have been instructed to do so at a given time and "it's in the plan". Further you must find an easy method to annotate steps which are the run in parallel, which those teams clearly understand. Dependencies must be known for both before & after steps. All parties have participated if not in the planning but at the minimum in the final "walkthrough" day or 2 before the exercise. Chat is one of the best communications methods for numerous reasons, plus you have email, conference line (video or audio only), command Ctr. if required. ALL need to be onboard with the protocol to be used. Pre-determined thresholds need to be highlighted in the plan with all in agreement: e.g. if we reach Noon on Saturday & we have not yet failed over file systems/ dB, middle wares, etc. to DR yet, we stop and go back to PROD.

During execution, you're recording start & stop times for each step, plus have 2 other individuals preferably 1 Business, & 1 Application "also recording times, so when you create the AAR (after action report), there's checks & balance to discuss. Again you as the lead, must be ready to pull in the right players at the slightest hint of something not sounding right, you need to escalate - Many Technical Experts (SMEs, etc.) have far too great pride to do so on their own, you need to sense when & pull the trigger, Escalate.

Post Exercise - Send an immediate email summarizing the highlights - never give RTA information at this time but give a sense of successful or not. ASAP have the post test meeting with all the players, what worked well, failed miserably, newly learned items, etc. Do we have all the Issues identified? Open/ closed/ Info only items/ etc. Collate Issues, assign ownership, and obtain remediation plan/s for each. Establish RTA (recovery time achieved or some call it recovery time actual) & was RPO measured against? It should be in more mature programs, if not you're missing out on a key ingredient. Leverage SOP tweaks and share across all the various IT teams who may benefit from (help desk on up), What commonalities can be shared for other similar Application recoveries / or whatever it is your exercise is testing for Computer room / data ctr/ suite of apps pertinent to a key \ specific business process.


I do have a question for you; I'm now (new to me) working in government, and wonder if anyone may have "gov examples" of Recovery Tiers.... Here (after only 3 weeks in) I can see they have next to nothing in play today.
There are some pretty good tools available now to monitor/audit DR plans that can determine if there are any gaps (missing replication, orphaned servers, etc.). These can help shorten the time it takes to test and can be used in between formal tests in conjunction with a change management app/process, as I mentioned before.
@Rich, I agree on the aspect you talked about of including change management. Our DR plan involved many different groups from across the organization, including a mix of change management (both application and infrastructure), CMDB, incident management, project management, on and on. It really does take the involvement of the entire IT team to make it work.
Once again you make a really good point, mcorum. Even the most straightforward DR plans usually involve a lot of apps, processes, departments, people, etc.--a lot of moving parts that need to be coordinated. It can certainly be like herding cats sometimes, but good upfront planning can make it much easier to manage.
It’s all too east to overlook an important role when assembling the test team, especially when the systems are highly integrated with other systems. I’ve found it beneficial to pull everyone together in a room to run through a dry run before actually trying to implement the test plan. Getting everyone in a room and talking it through provides a good opportunity to identify those roles that were overlooked.