How frequently should you conduct failover testing?

Marc Staimer of Dragon Slayer Consulting says failover testing should be part of data protection, disaster recovery and business continuity processes.

The answer is relative. Failover testing should be part of data protection, disaster recovery and business continuity...

processes. It is an essential part of data protection that unfortunately gets short shrift. Failover tests, while often unsuccessful, can reveal errors in processes, designs, architectures, decisions, products and assumptions.

Competent professionals expect issues and problems from failover tests. They learn from the inevitable failures then fix and improve their processes in a low business-risk environment (i.e. the test). Not-so-competent professionals avoid testing or minimize it because they're afraid it will reveal errors in judgment or in promises made to management.

Unfortunately, the absolute worst time to discover failover problems is when a disaster strikes. At that point in time, there is little that can be done -- except to clean up.

Failover testing, in an ideal world, will be performed frequently. Time, personnel, resources and budgets will dictate that frequency. Business continuity best practices recommend quarterly failover testing; however, that schedule may not be pragmatic for many organizations. At the least, failover testing should be performed semi-annually.

The first step required for failover testing starts with management committing to performing the testing on a regular, periodic schedule. The second step is determining application and data value. That value is commonly determined by the recovery point objective (RPO) and recovery time objective (RTO). For example, if the data is backed up, snapped or replicated once a day, then the RPO is 24 hours. If the data is protected 4 times a day, then the RPO is 6 hours. If the data is protected on every write as with CDP (continuous data protection), the RPO is "zero." RTO is simply defining how much application and data downtime is tolerable.

The next step is the determination of application and data prioritization. This is the process of figuring out what applications -- and their data -- must be brought up and made running first. What about the employees? How will they work? How will they access their desktops, applications and data? All of that must be worked out, as well. It does the organization no good if the applications and data are up and running, but no one has access to them.

Once all of the failover processes and procedures have been determined, a written plan clearly specifying in detail the steps required, the primary individual responsible for each step, a backup individual for each step, and the expected timeframes for each step to handle a failover, is mandatory.

Testing requires that a realistic representational portion of that plan be put through its paces multiple times, simulating multiple contingencies, such as the primary person for a process being incapacitated. Following each test, implement what was learned from the test, correct errors and improve the process. The written plan should also be updated after each failover test.

One final step must be part of the process. The plan is a living process. New applications, data and systems are added to IT environments all of the time, while old ones are retired or taken offline. There must be one (preferably more) IT professional responsible for ensuring the failover plan is continuously up to date.

It is ultimately all about prioritization, process, preparation, practice and patience. Create a plan, work the plan, test the plan, evaluate the test, modify the plan, and repeat.

Next Steps

Protect your data center before attempting a failover

Six steps for effective cloud-based DR

Backup power maintenance best practices

Dig Deeper on Disaster recovery planning - management