| In addition to periodically testing your disaster recovery (DR) site, DR testing tools can constantly monitor the site's readiness to recover from a disaster.
The first place to look for confirmation that critical backups and data replications to the DR site took place is the backup and replication software you're using and the reports it generates. In addition, the same tools storage administrators use for day-to-day storage efficiency and management can help assess the health of their backup infrastructures, says John Sing, a senior consultant on business continuity strategy and planning at IBM Corp.
Next, you should test the recoverability of data and apps from the DR site. How often each test is run and how extensive each test should be varies; many firms perform tests of one or several apps, or on selected portions of the backup environment to reduce DR testing costs and avoid the risk of disrupting production apps.
Patrick Honny, departmental information systems manager for the County of San Bernardino Auditor/ Controller-Recorder, performs an overall DR test (based on the assumption that the entire primary site has failed) once a year. He also tests the storage portion of his DR plan once a month. "Basically, that's as simple as repointing production servers to the Isilon [Systems Inc.] SANs that are receiving replicated data and pulling up some files," he says.
IBM's Sing, who uses automated testing tools, says he "can no longer afford to take five, 10 or 15 highly paid individuals off their jobs and dedicate them to two days of testing." Automated test tools ensure tests are done consistently and can be repeated over time, which makes them more useful for auditing purposes.
For example, Compuware Corp.'s Hiperstation line of software "records network traffic heading to and from the mainframe," says Mark Schettenhelm, Hiperstation product manager. It can then be used to access a remote site "and it's as if you have a hundred people beating away at that system," he says.
A number of vendors aim to solve that problem by creating replicas of data that can be written to and kept current. According to NetApp, its FlexClones act as "a transparent writable layer" in front of the snapshot copy of the data in the DR facility, allowing the snapshot to be used for DR testing while remaining up to date in case of an actual disaster. Symantec Corp. has enhanced a similar feature called Fire Drill in release 5.0 of Veritas Storage Foundation for Windows.
CA offers a similar capability with XOsoft Assured Recovery, which suspends the replication of data between a master and replica copy, spooling changes made during a test and applying them to the replica only after the test is over and replication resumes. CA XOsoft Assured Recovery works in conjunction with the CA XOsoft Replication and CA XOsoft High Availability DR and business continuity software (formerly WANSync and WANSyncHA, respectively).
These writable replicas not only make it safer to do DR testing, but make the replicated data available for other uses, such as data mining, and test and development.
Staying in sync
"With pressure on businesses to respond more quickly to customer demands, the IT infrastructure supporting the business changes on a daily basis," says Bob Laliberte, an analyst at Enterprise Strategy Group (ESG), Milford, MA. "So any DR environment that was implemented and tested on day one could be at risk on day two. Unfortunately, that risk isn't discovered until a copy of the backup is needed or a DR test is run."
Jeff Pelot, chief technology officer at the Denver Health Hospital and Medical Center, saw how a seemingly minor change in the backup environment can cripple a DR effort when one of his LeftHand Networks Inc. IP SANs failed to take over for another during a monthly DR test. "LeftHand was migrating from their iSCSI initiator to Microsoft's [iSCSI Software] Initiator ... and I guess we found a bug," he says. Pelot says he's now "fairly comfortable" his systems will work as needed, but he admits that "what I don't know is what scares me"--that any update by any vendor's product might introduce a similar bug that could crash his DR systems. For that reason, his staff carefully evaluates which upgrades are critical and applies them to one system in a cluster at a time to ensure they work before installing them on the other system, he says.
Many storage shops only maintain a knowledgeable staff at their production site and not at a DR site, says Dan Lamorena, senior product marketing manager at Symantec. That makes it harder to ensure that no critical data is lost during replication, he says, and that the proper volumes and LUNs are configured in the right way at the DR site.
Monitoring and configuration tools
Continuity Software Inc.'s RecoverGuard continually checks to determine that all changes made to the primary site are reflected on the backup site; identifies dependencies among servers, storage devices, databases and apps; and automatically detects gaps such as the passive node on a cluster not being mapped to the correct disk volume, which could cause a problem during a DR event.
RecoverGuard (which Continuity Software plans to offer as a service) currently supports EMC Corp. and NetApp arrays, with support for Hitachi Data Systems hardware expected in the next few months.
WysDM Software Inc.'s WysDM for Backups continually monitors back-up environments and provides customized policy-based alerts about problems that could affect the DR environment. (EMC recently acquired WysDM Software.) Among other conditions, it can report on servers that haven't been backed up within a given period of time, and backups that need to be rescheduled to meet service-level agreements.
BladeLogic's Data Center Automation suite can monitor the configuration of backup data centers to see if they vary from corporate policies. Vick Viren Vaishnavi, BladeLogic's director of product marketing, says some customers are using the suite to help them synchronize changes among sites.
Tracking such changes is much easier when a firm maintains the same hardware and software at both the primary and DR site, and uses virtualization to mask any disparities between the two. Dick Cosby, systems administrator at Estes Express Lines, used system utilities for his firm's IBM System Storage DS8000 storage to mirror changes in data, as well as changes to the sizes of the underlying volumes and LUNs, between the Richmond, VA-based headquarters and a DR facility in Mesa, AZ. But Cosby says he wouldn't be able to rely on this integrated process if, for example, he introduced EMC storage into the environment.
All too often, says Sing, "storage administrators don't have as much insight as they need about the business value of the data for which they're responsible." That makes it harder for them to restore the most crucial data the fastest. One shortcoming that often emerges during testing is the realization that the storage admin has backed up only the data files associated with an app, and not the files needed to actually bring up the app at the DR site, says Greg Schulz, founder and senior analyst at StorageIO Group, Stillwater, MN.
Some cultural changes can also ensure tests get done. While EMC suggests customers test critical apps at least several times a year, many are still struggling to test once a year, says John Linse, EMC's director of business continuity services. A summer 2007 survey of more than 1,000 data center managers by Symantec showed that a lack of staff, fear of disrupting business as usual and a lack of money are the most common barriers to running a full DR test. One reason, says Linse, is that "DR is sort of a second fiddle to operations" in many companies.
Educating others about storage's role in keeping vital apps running can ensure that staff members keep storage admins informed about changes that could affect DR, says Gil Hecht, CEO at Continuity Software.
He gives the example of a DBA who needs more disk space for a critical production app on a weekend, when the storage admin isn't available to provision it. The DBA might take unused space on a test and development system to keep the production app running and plan to tell the storage admin later. But if the DBA forgets, "the storage guy hasn't got a clue what his disk is being used for," says Hecht, or that the test and development system isn't replicated to a backup site and has thus invalidated the firm's DR plan.
Automated tools catch such changes, but Hecht says storage admins must educate other staffers about the need to communicate big and small changes, as even a small configuration change will affect the company's ability to plan for and recover from disasters.