Manage Learn to apply best practices and optimize your operations.

Disaster recovery: Test, test and test some more

Storage managers in New Orleans thought their disaster recovery (DR) plans were solid. Hurricane Katrina showed them otherwise. These dramatic stories are testimony that a DR plan is worthless unless it's been tested, updated and then tested again.

As Hurricane Katrina proved, a DR plan is worthless unless it's tested thoroughly.

When New Orleans law firm Chaffe McCall L.L.P. was hit by Hurricane Katrina in 2005, network administrator James Zeller was in for an unpleasant awakening. Having learned to live with hurricanes, Zeller thought he was prepared for any eventuality. His disaster recovery (DR) plan was simple and, he thought, effective. With a smaller office in Baton Rouge, LA, and backup tapes stored in an offsite location, the law office's DR plan called for restoring data and applications in Baton Rouge and conducting business with users connected remotely.

Zeller never tested Chaffe McCall's DR plan and its post-Katrina recovery was stymied by one surprise after another. Underpowered servers running Microsoft Corp.'s Windows NT Server 4.0 were unable to run apps requiring Windows Server 2000 or 2003; and the lack of suitable tape drives in the recovery site, as well as the inability of the infrastructure in Baton Rouge to cope with the increased computing and network load, brought the business to a grinding halt. Unable to procure servers due to high post-hurricane demand for hardware, Zeller rushed to purchase and configure high-end desktops to run his apps. After a strenuous week, business returned to something approaching normal with one cruel lesson learned: An untested DR plan isn't worth the paper it's printed on.

Chaffe McCall is hardly alone in its experience. DR planning and testing is costly, and the contingency nature of DR makes it susceptible to budget cuts and testing shortcuts.

The cost of a single DR test can easily exceed $100,000 in larger organizations. To control costs, some companies perform several smaller component tests during the year in addition to a full annual test. Others extend the testing cycle across several years. "As part of our business-continuity plan, we are obligated to test each business application at least once every three years," reports Dan Traynor, director of IT infrastructure services at Southern Company, an energy firm in Atlanta.

DR testing fundamentals
  • Disaster recovery (DR) testing isn't about pass and fail. It's about exercising and rehearsing the DR plan to reveal shortcomings and weaknesses.
  • DR plans aren't static; they need to be regularly reviewed and adjusted to reflect changes in the enterprise and business.
  • Every recovery step needs to be painstakingly documented and tested. Seemingly minuscule details can bring the recovery effort to a standstill.
  • People are key, and everyone needs to know their role.
  • Be sure to have a designated DR coordinator and an established chain of command. The ability to make fast decisions during a disaster is crucial for a rapid and successful recovery.
  • Always have a plan A, B and C.

What to test?
Only a limited number of services and applications can be realistically included in a DR test plan. Business criticality of services, exposure to loss, risk tolerance, and an assessment of threats and vulnerabilities determine what needs to be included, priorities and within what timeframe--defined by recovery time objectives (RTOs) and recovery point objectives (RPOs)--services need to be restored. DR planning and testing is a continual balancing act among costs, budget and the potential business loss if a disaster occurs.

While prioritization identifies mission-critical applications and services, seemingly low-priority services that high-priority apps depend on are sometimes excluded from the DR test. With IT services and infrastructure tightly intertwined, the risk of neglecting low-priority dependent services is high. This is exactly what happened to Jim Burgard, assistant vice chancellor for university computing and communication at the University of New Orleans (UNO). The university's inability to get to its backup tapes after Hurricane Katrina prevented Burgard from recovering Active Directory, requiring a total Active Directory rebuild from scratch.

With the interdependency of IT services, it's pertinent to consider and exercise all aspects that impact the recovery of mission-critical services, including the following areas:

  • The network
  • Data/storage
  • Applications
  • The data center
  • Communication systems

There isn't a single "right" way to perform DR testing because it depends on the specific situation, defined priorities and the DR plan at hand. The level of redundancy in place has a huge impact on the DR exercise. For instance, the effort to rehearse failing over to a continuously updated redundant storage array in a secondary data center is relatively simple; in contrast, having no secondary array to fail over to requires restoring terabytes of data and rehearsing the loss of the data center itself. The DR testing efforts and costs associated with the two scenarios differ greatly, and companies need to do a thorough analysis before deciding whether to invest in redundancy or to pour money into a more elaborate rehearsal. The real payoff of redundancy comes into play when an actual disaster strikes. Loss of business productivity, caused by long recovery times, can easily exceed the cost of redundancy.

The ability to recover depends on accurate documentation, and a DR test needs to execute the documented recovery steps meticulously. DR documentation needs to be updated continuously and reviewed periodically. "Whenever we procure or develop a new application or system, we review the DR requirements and update our DR plan and DR test procedures accordingly," reports Paul Stonchus, first vice president and data center manager at MidAmerica Bank in Clarendon Hills, IL.

The single greatest risk in keeping up with changes is the lack of a solid change management and verification process to ensure that changes are performed according to procedure. Change management tools like Finisar Corp.'s NetWisdom, Onaro Inc.'s SANscreen, Tek-Tools Inc.'s StorageProfiler, as well as those tools built into storage, network and system management suites, track changes in your environment. Besides third-party tools and free tools such as Syslog, monitoring and element managers like Cisco Systems Inc.'s Cisco Device Manager also track changes.

The network
The DR plan and testing needs to consider LANs, WANs of connected sites, and Internet connectivity and remote user access, mostly in the form of client virtual private networks (VPNs). Unless prohibited by budgetary restrictions, all mission-critical network connections should be designed redundantly. For LANs and SANs, especially in the data center, this means having servers and storage arrays redundantly attached to dual-core network switches and Fibre Channel (FC) switches, and using dynamic routing protocols like Open Shortest Path First (OSPF) for IP networks, multipath I/O for FC-attached storage and port trunking for ISL links to perform automatic failover. WAN and Internet connection redundancy requires two carriers or leveraging site-to-site VPNs to back up private-line WAN circuits, harnessing dynamic routing protocols like Border Gateway Protocol (BGP-4) or OSPF to perform the automatic failover.

Testing redundant network connection failover is relatively straightforward and can be as simple as forcing a manually induced failure of the primary link by disabling a port on a switch or router. Unless flawed, dynamic routing should redirect traffic through the redundant circuit without any disruption. Due to the complex nature of routing, it's highly recommended to verify failover beyond simply accessing resources on the remote site. Tools like traceroute, network mappings, and topology graphing tools in storage and network management apps verify proper failover.

Redundancy can also breed complacency and create a false sense of safety. A DR rehearsal of the network needs to contemplate different scenarios so you don't fall into common traps. An important aspect of redundant network connections is the dependency between primary and failover connections. Commonalities need to be clearly identified, as they present a single point of failure. For instance, the resilience of two Internet connections from two different carriers using BGP-4 routing for automatic failover is compromised if the wiring for both connections enters the building through the same minimum point of entry. Similarly, having the primary and failover connection from the same carrier is problematic in case the carrier experiences difficulties.

DR testing of network connections without redundancy is more difficult, and comprises a combination of component testing and process, documentation and service-level agreement (SLA) verifications. It begins with verifying the process for replacing defunct switches and routers, including testing the configuration and proper operation of spare equipment, as well as the procedure for getting replacement hardware. Vendor agreements, contact information and SLAs need to be verified and at least partially tested.

Finally, the network DR rehearsal needs to account for changes in network load in case of a disaster. Typically, a failover circuit is a lower cost alternative to the primary connection such as a VPN or frame-relay circuit. The DR test must ensure that the network connections can deal with the increased network load during a disaster. Some DR plans count on users connecting from home via VPN client software to conduct business. Under normal circumstances, only a relatively small fraction of the employee community connects via VPN; that number will grow substantially during a disaster. The DR test must ensure the VPN server can deal with the increased load and that there's plenty of Internet bandwidth to cope with the bandwidth surge caused by the increase in VPN usage.

Understanding BC and DR terms
Business continuity (BC) and disaster recovery (DR) terms can be confusing. Here are the definitions of some terms you're likely to encounter:

BC is an overarching term that includes DR, business recovery, business resumption, contingency planning and crisis management.

Business recovery deals with the recovery of human resources and how to conduct business during, and right after, a disaster.

Business resumption deals with resuming business functions between the point when an event occurs and the point at which a disaster is declared. For instance, if the e-commerce Web site becomes unavailable, order taking is delegated to designated staff as part of the business resumption plan.

DR deals with the recovery of technology, including facility, infrastructure and IT services.

Contingency planning deals with problems and disasters related to third-party service providers like application service providers and clearinghouses.

Crisis/Incident management is the command center that manages events during a disaster. It includes communication to employees, customers and the public, and ensures that all relevant parties know what to expect.

Data storage
A solid recovery-from-tape DR strategy requires several tape sets to be stored offsite. Meticulously testing the process of retrieving tapes from the offsite location is imperative. The DR exercise needs to include a verification of the offsite contact information, the list of users who are permitted to request tapes, and an assessment of how many and which tape sets are kept offsite.

The DR test needs to challenge and verify policies to avoid unpleasant surprises like the one experienced by Bill Bremerman, global services manager at Cookson Electronics, Providence, RI, who lost two of three tape sets that were in transit and never made it to the offsite location.

You also need to assess to what extent the offsite location may be impacted by a disaster. UNO's Burgard couldn't get to his offsite tape sets for two weeks because they were stored in New Orleans during Hurricane Katrina. "We changed our offsite tape location to Iron Mountain [in Boston] and we are now storing tapes 80 miles from the data center," he says.

Data that requires a more aggressive RTO and RPO is protected through a data replication product or emerging continuous data production (CDP) offerings. As CDP products become more common and affordable, companies are beginning to use them to protect all tiers of storage. It isn't uncommon for companies to implement a replication or CDP product after a DR test--or an actual disaster--to improve their ability to recover. After a painful recovery from tape after Hurricane Katrina, Chaffe McCall's Zeller implemented a host-based replication product from XOsoft (acquired by CA Inc.), and Burgard installed a host-based data protection solution from Neverfail Group Ltd.

Verification of the consistency and completeness of the replicated data is imperative during a DR test of data protected via replication or CDP. In addition to business-user verification, custom scripts or third-party file-verification tools should be run to verify that the secondary data is in sync with primary data. Although most CDP and replication solutions perform consistency checks, the DR test needs to challenge and confirm independent of the replication or CDP product that data is replicated completely and consistently.

Most DR plans are constructed around recovering applications. The list typically starts with critical business applications like the enterprise resource planning, CRM or manufacturing execution system, followed by lower priority applications like the company's public Web site.

The DR test of redundantly run apps consists of three steps: Initiating the failover, business-user verification of transactions and application consistency, and switching back to the primary application instance. The easiest failover scenarios to test are symmetrically load-balanced applications like a Web site or grid clusters; the only impact of disabling nodes--as long as the failover works--is an increased load on the remaining nodes. A performance impact analysis should be part of the test.

Transactional applications like databases or Exchange can't be load balanced as easily and require a cluster-type application like Microsoft Cluster Service, CA XOsoft's WANSync or products from Neverfail Group. These products continuously replicate transactions and changes to the failover system, monitor availability of the primary application and perform an automatic failover in case the primary instance fails. DR testing of clustered applications is disruptive, as the primary instance can't be used while the test is performed. This was the reason for Chaffe McCall's Zeller to choose the CA XOsoft WANSync product family.

"[CA] XOsoft's Assured Recovery enables us to conduct comprehensive and regular DR tests to standby servers without any disruption to our production servers or any interruption to the disaster recovery protection," explains Zeller.

Recovering applications without a failover solution depends on restoring the app from backups. To avoid Zeller's experience during Hurricane Katrina, your DR test needs to ensure that the correct hardware, operating system and application software are available and that the recovery can be successfully performed on the designated recovery systems. The test needs to ensure that all required components, especially the application software, have a valid maintenance contract and valid software protection codes (SPCs) if required by an application. SPCs bind an application to specific hardware.

"During a disaster, you never want to hear that 'it' is out of contract," says Cookson Electronics' Bremerman. "Some of our applications were out of maintenance and we had to pay a premium to get them back on maintenance to get a valid SPC code."

Furthermore, vendor contact information needs to be up to date and you must ensure that the hours of support are in line with RTOs. "Getting hold of our vendors during the night and on the weekend was a huge challenge," laments Bremerman. "It took about 30 hours to get a good SPC code from JD Edwards."

The data center
Data center recovery testing prepares for the worst-case scenario: losing a complete data center. Companies have taken two approaches to resume operations after a data center loss. The first and least-expensive option is to resume data center operations in another branch office. If you take this route, the recovery test must ensure that the designated branch office is ready to perform double-duty, a lesson learned by Zeller. If harnessing another branch office isn't an option, companies can contract for collocation space to resume data center operations.

The second option is to have a designed DR site--from companies like Hewlett-Packard Co., IBM Corp. and SunGard Data Systems Inc.--equipped with the infrastructure and resources to resume operations. If you opt for a hot site with one of these DR vendors, agreements should specify the frequency and extent of DR tests. A contract with a defined scope and SLAs makes a difference. "As part of our contract with SunGard, we get two [sessions of] 72 hours of DR testing annually," says Southern Company's Traynor.

Obviously, the ability to communicate with employees and members of the recovery team is critical during a disaster. Assuming availability of the corporate e-mail server or PBX during a disaster is risky. Independent methods of communication can be implemented in various ways. For instance, UNO's Burgard relies on a hosted Web application to communicate with the university staff and students. "The communication Web site plays a central role during our DR tests," says Burgard. Zeller signed up with a low-cost, third-party e-mail provider, using Postini Inc.'s hosted service to reroute e-mail from the corporate e-mail servers to the third-party e-mail service if needed. Most business-continuity planning products include communication modules, such as Strohl Systems Group Inc.'s NotiFind, that are used to deliver critical messages and receive important data during a disaster.


Dig Deeper on Disaster recovery planning - management