Tools to test your DR plan

Periodically testing a disaster recovery (DR) plan is essential, but it can be a time-consuming and expensive task. New tools that check DR configurations and constantly monitor your site's readiness to recover from a disaster can cut costs and testing time, and provide a level of confidence that your DR plan will actually work when it's needed.

In addition to periodically testing your disaster recovery (DR) site, DR testing tools can constantly monitor the site's readiness to recover from a disaster.

Testing the storage portion of your disaster recovery (DR) plan requires tools to ascertain if data was backed up properly. Proper testing may also require an application to constantly monitor the DR site's storage infrastructure--from the number of disks to the configuration of RAID arrays--to ensure it works and matches the storage configuration at your primary site.

The first place to look for confirmation that critical backups and data replications to the DR site took place is the backup and replication software you're using and the reports it generates. In addition, the same tools storage administrators use for day-to-day storage efficiency and management can help assess the health of their backup infrastructures, says John Sing, a senior consultant on business continuity strategy and planning at IBM Corp.

Next, you should test the recoverability of data and apps from the DR site. How often each test is run and how extensive each test should be varies; many firms perform tests of one or several apps, or on selected portions of the backup environment to reduce DR testing costs and avoid the risk of disrupting production apps.

Patrick Honny, departmental information systems manager for the County of San Bernardino Auditor/ Controller-Recorder, performs an overall DR test (based on the assumption that the entire primary site has failed) once a year. He also tests the storage portion of his DR plan once a month. "Basically, that's as simple as repointing production servers to the Isilon [Systems Inc.] SANs that are receiving replicated data and pulling up some files," he says.

Production-level testing
Some users may want to test their DR environment under actual production-level conditions with, for example, the actual number of users an app supports and the level of transactions the app needs to process in specific time intervals. In those cases, automated software can be used to reduce the time, effort and expense required.

IBM's Sing, who uses automated testing tools, says he "can no longer afford to take five, 10 or 15 highly paid individuals off their jobs and dedicate them to two days of testing." Automated test tools ensure tests are done consistently and can be repeated over time, which makes them more useful for auditing purposes.

For example, Compuware Corp.'s Hiperstation line of software "records network traffic heading to and from the mainframe," says Mark Schettenhelm, Hiperstation product manager. It can then be used to access a remote site "and it's as if you have a hundred people beating away at that system," he says.

Writable clones
Any test requires accessing the replicated data, even if only to read or write to a few sample database fields. That access can interrupt the replication process, raising the danger that the replicated data would become out of date if an actual disaster or hardware failure occurred during the DR test.

A number of vendors aim to solve that problem by creating replicas of data that can be written to and kept current. According to NetApp, its FlexClones act as "a transparent writable layer" in front of the snapshot copy of the data in the DR facility, allowing the snapshot to be used for DR testing while remaining up to date in case of an actual disaster. Symantec Corp. has enhanced a similar feature called Fire Drill in release 5.0 of Veritas Storage Foundation for Windows.

CA offers a similar capability with XOsoft Assured Recovery, which suspends the replication of data between a master and replica copy, spooling changes made during a test and applying them to the replica only after the test is over and replication resumes. CA XOsoft Assured Recovery works in conjunction with the CA XOsoft Replication and CA XOsoft High Availability DR and business continuity software (formerly WANSync and WANSyncHA, respectively).

These writable replicas not only make it safer to do DR testing, but make the replicated data available for other uses, such as data mining, and test and development.

Click here for a sampling of
backup reporting applications (PDF).

Staying in sync
One major challenge in maintaining a DR site is configuring it exactly the same as the primary environment, so that when application servers link to the backup site they can connect to the appropriate storage and continue operation.

"With pressure on businesses to respond more quickly to customer demands, the IT infrastructure supporting the business changes on a daily basis," says Bob Laliberte, an analyst at Enterprise Strategy Group (ESG), Milford, MA. "So any DR environment that was implemented and tested on day one could be at risk on day two. Unfortunately, that risk isn't discovered until a copy of the backup is needed or a DR test is run."

Jeff Pelot, chief technology officer at the Denver Health Hospital and Medical Center, saw how a seemingly minor change in the backup environment can cripple a DR effort when one of his LeftHand Networks Inc. IP SANs failed to take over for another during a monthly DR test. "LeftHand was migrating from their iSCSI initiator to Microsoft's [iSCSI Software] Initiator ... and I guess we found a bug," he says. Pelot says he's now "fairly comfortable" his systems will work as needed, but he admits that "what I don't know is what scares me"--that any update by any vendor's product might introduce a similar bug that could crash his DR systems. For that reason, his staff carefully evaluates which upgrades are critical and applies them to one system in a cluster at a time to ensure they work before installing them on the other system, he says.

Many storage shops only maintain a knowledgeable staff at their production site and not at a DR site, says Dan Lamorena, senior product marketing manager at Symantec. That makes it harder to ensure that no critical data is lost during replication, he says, and that the proper volumes and LUNs are configured in the right way at the DR site.

Click here for a sampling of
tools that report on the storage infrastructure (PDF).

Monitoring and configuration tools
The difficulty of manually synchronizing a primary environment and a backup environment has led to the development of automated monitoring and configuration tools. The FlashSnap feature in Veritas Storage Foundation Enterprise enables admins to create application server groups that ensure application servers are configured to connect to the appropriate backup storage in the event of a hardware failure in the primary data center, says Sean Derrington, Symantec's director of storage management. This eliminates much of the effort and error associated with manually configuring such servers, he says.

Continuity Software Inc.'s RecoverGuard continually checks to determine that all changes made to the primary site are reflected on the backup site; identifies dependencies among servers, storage devices, databases and apps; and automatically detects gaps such as the passive node on a cluster not being mapped to the correct disk volume, which could cause a problem during a DR event.

RecoverGuard (which Continuity Software plans to offer as a service) currently supports EMC Corp. and NetApp arrays, with support for Hitachi Data Systems hardware expected in the next few months.

WysDM Software Inc.'s WysDM for Backups continually monitors back-up environments and provides customized policy-based alerts about problems that could affect the DR environment. (EMC recently acquired WysDM Software.) Among other conditions, it can report on servers that haven't been backed up within a given period of time, and backups that need to be rescheduled to meet service-level agreements.

BladeLogic's Data Center Automation suite can monitor the configuration of backup data centers to see if they vary from corporate policies. Vick Viren Vaishnavi, BladeLogic's director of product marketing, says some customers are using the suite to help them synchronize changes among sites.

Tracking such changes is much easier when a firm maintains the same hardware and software at both the primary and DR site, and uses virtualization to mask any disparities between the two. Dick Cosby, systems administrator at Estes Express Lines, used system utilities for his firm's IBM System Storage DS8000 storage to mirror changes in data, as well as changes to the sizes of the underlying volumes and LUNs, between the Richmond, VA-based headquarters and a DR facility in Mesa, AZ. But Cosby says he wouldn't be able to rely on this integrated process if, for example, he introduced EMC storage into the environment.

Test more often and more thoroughly
In testing disasterrecovery (DR) plans, IT staffs often overlook how applications depend on each other for data and don't understand which related sets of data need to be backed up and restored to the same point in time, says Bill Peldzus, director of data centers, business continuity and disaster recovery at GlassHouse Technologies Inc., Framingham, MA.

By failing to fully understand application dependencies and properly identify "consistency groups"--related sets of data which should be updated to the same point in time--organizations risk running DR tests that don't accurately reflect how well an app could be recovered after an actual disaster, he says.

For example, an order-entry system might have a recovery point objective (RPO) and a recovery time objective (RTO) of four hours. A customer database with which it shares data might have an RPO and RTO of 12 hours. When the data for both applications is restored in a test, "you have customers without orders and orders without customers because the data wasn't recovered in a consistent fashion," says Peldzus.

While many replication apps can identify consistency groups, they're much more difficult to use across different platforms, such as PC-based, minicomputer and mainframe servers, he says. Often, says Peldzus, test designers lack the time to properly configure such tools to account for all of the platforms a company operates.

Peldzus also recommends that companies test to see if they can recover all of their critical apps within the required time period, rather than (as is often the case) only testing the apps needed by one business unit or that serve one function, such as email.

Finally, he suggests testing critical apps most often. Apps with RPOs or RTOs of less than 24 hours are candidates for testing two to four times per year, rather than just annually, he says.

Cultural changes
Along with the right technology, storage admins need to put the proper processes in place to make sure their DR environments work as needed in an actual disaster. IBM's Sing recommends admins develop detailed schemas that specify what data is most important and needs to be recovered quickly. That will help them determine if they're backing up the right data to meet required recovery time objectives and recovery point objectives.

All too often, says Sing, "storage administrators don't have as much insight as they need about the business value of the data for which they're responsible." That makes it harder for them to restore the most crucial data the fastest. One shortcoming that often emerges during testing is the realization that the storage admin has backed up only the data files associated with an app, and not the files needed to actually bring up the app at the DR site, says Greg Schulz, founder and senior analyst at StorageIO Group, Stillwater, MN.

Some cultural changes can also ensure tests get done. While EMC suggests customers test critical apps at least several times a year, many are still struggling to test once a year, says John Linse, EMC's director of business continuity services. A summer 2007 survey of more than 1,000 data center managers by Symantec showed that a lack of staff, fear of disrupting business as usual and a lack of money are the most common barriers to running a full DR test. One reason, says Linse, is that "DR is sort of a second fiddle to operations" in many companies.

Educating others about storage's role in keeping vital apps running can ensure that staff members keep storage admins informed about changes that could affect DR, says Gil Hecht, CEO at Continuity Software.

He gives the example of a DBA who needs more disk space for a critical production app on a weekend, when the storage admin isn't available to provision it. The DBA might take unused space on a test and development system to keep the production app running and plan to tell the storage admin later. But if the DBA forgets, "the storage guy hasn't got a clue what his disk is being used for," says Hecht, or that the test and development system isn't replicated to a backup site and has thus invalidated the firm's DR plan.

Automated tools catch such changes, but Hecht says storage admins must educate other staffers about the need to communicate big and small changes, as even a small configuration change will affect the company's ability to plan for and recover from disasters.

Dig Deeper on Disaster recovery planning - management