Are you confident your disaster recovery (DR) plan will work if a disaster strikes? When TheInfoPro Inc., a NY-based research network, posed that question to several hundred IT executives, the results weren't exactly reassuring. Only 55% of the managers surveyed were confident they could recover their open-systems data in an emergency. The rest were only somewhat confident or not confident their DR system would work.
|Losses incurred in a disaster|
No matter how many checklists a company makes and distributes, the number of disaster scenarios it considers or even how assiduously it backs up its data, managers can't be confident in a firm's ability to recover data if the systems haven't been tested thoroughly. "You have to test to see if your disaster recovery processes really work," says Michelle Zou, research analyst with the storage software team at IDC, Framingham, MA. "Not everybody does enough testing."
Testing is difficult. "It's a complicated process. You're talking about mission-critical applications that companies don't want to take down," Zou says. As a result, tests have to be scheduled far in advance and, to do it right, the testing will likely require the involvement of a large number of people. All of this drives up costs. "And what if the tests don't work?" Zou asks. The organization has to go back through the entire process to identify and fix the problems and then test again--which means more time, money and disruption.
Large mainframe IT shops can often offer a model for DR testing. MasterCard International Inc., for instance, has been honing its DR processes for 15 years and continues to refine them. The current testing plan calls for two major test exercises a year in April and October. Each exercise tests up to 40 of what MasterCard classifies as its Tier 1 systems to meet a corporate DR mandate of testing every Tier 1 system at least once a year.
|Disaster recovery testing costs|
Such mainframe-style DR testing is expensive, something only the largest companies, and those that require bulletproof DR, can afford. "Even small tests can cost $30,000 per test," reports Male. "Large tests can run $1 million a test." Direct costs include the time of the people involved, telecommunications costs, the cost of activating a hot site or another remote facility, and travel (see "Disaster recovery testing costs," this page).
Obviously, testing costs would seem trivial if you couldn't recover from a disaster in a timely way (see "Losses incurred in a disaster," this page). According to a recent Forrester Research study, companies with annual revenues of at least $1 million from an online business average losses of $8,000 per hour during a systems outage, which comes to $192,000 for each 24-hour period the site is down. An earlier study by Meta Group (now Gartner Inc.) found that unplanned downtime of critical systems could cost a large company as much as $1 million per hour due to lost revenue, reduced employee productivity and possible regulatory penalties. And the $1 million figure doesn't include the negative impact on the business' reputation. It's no wonder companies like MasterCard don't skimp on DR testing.
Even organizations like Harvard University feel a compelling need for DR, although costs are a significant issue governing how well each of the university's various apps are protected. "We do primary infrastructure recovery for those applications willing to pay for it," says Ron Hawkins, senior technical architect at the Cambridge, MA-based institution. Not surprisingly, the only business units willing to sign up for Harvard's IT infrastructure recovery are those responsible for critical apps such as payroll, financial and data warehousing. "Everybody wants DR until they see the price," he notes.
Faced with widespread demand for less-costly DR options, Hawkins has explored a variety of options, including VMware, snapshots and remote replication, in an effort to provide a less-expensive recovery service that would still be effective. He's also tried working with business units to juggle their recovery point objectives (RPOs) and recovery time objectives (RTOs) to come up with something less costly to implement, test and maintain (see Disaster recovery testing tips).
The only alternative Harvard has come up with to sending a team to its hot site for several days is a collocation facility located off campus. "They'll sell us rack space and we can replicate some of our critical infrastructure systems there," says Hawkins. These systems include the e-mail hub and DNS service, and perhaps a half- dozen small, but critical, utility services that have to remain accessible to apps outside of Harvard even if all systems go down in a disaster. By replicating them to the collocation facility, they can be recovered nearly instantaneously.
Beyond cost is the business disruption caused by DR testing. Organizations are constantly playing with their DR budgets to decide how much money to allocate for DR and how much money to devote to testing their DR infrastructure. It's a catch-22 situation: "[Many] companies can't afford to take down their systems to test failover," says Stephanie Balaouras, senior analyst in the enterprise computing and networking decision service at Boston-based Yankee Group.
For companies looking for shortcuts that will reduce the pain of DR testing, XOsoft introduced Assured Recovery this past spring. The software promises to perform a full test of data and application recoverability on either a regular or ad-hoc basis without disrupting the production system. Assured Recovery basically spools data changes to a separate file while it validates if the backed up system will work in a recovery situation.
"The nice thing about XOsoft is that you don't have to take down your application, but you can still validate that the data you're relying on for DR is consistent, recoverable and restartable," says Balaouras. Although it will test the recoverability of the data, that alone, unfortunately, doesn't constitute full DR testing. "It doesn't replace testing people and procedures and all the interdependencies," she notes.
Denny's Inc., the Spartanburg, SC-based restaurant chain, is typical of companies that test their open-systems DR plans once a year, suggests TheInfoPro's Male. To keep costs reasonable, the company focuses on testing the recovery of the IT infrastructure only. "This is not the same thing as testing what the business would need for full business continuity," says Kurt Hazel, senior systems administrator at Denny's.
The company basically tests the backup process. "We make sure the facility has the correct systems to bring back the application and the data. We're not moving people around to test business recovery," Hazel says. Denny's sends a system administrator and a couple of business users to its Atlanta hot site for three days. The first two days are usually spent checking out system configurations and logs. The actual application testing--payroll, financials and ERP system--takes only one day.
|Disaster recovery testing tips|
The following tips were compiled by Mike Karp, senior storage analyst at Enterprise Management Associates Inc., Boulder, CO.
Business function testing
Just getting a backup tape to load doesn't qualify as DR testing at a major telecommunications company. "For us, doing a test means we can do the complete business function," the firm's storage manager says. That typically entails testing dozens, if not hundreds, of applications that are sometimes interdependent with other applications and dependent on various types of systems.
Given the difficulty of actual business function testing, the telecommunications company has adopted a three-tier test strategy. The first tier is to prove that the disaster team can recover a local copy of the database within the agreed-upon recovery time. "We don't consider it DR, but it is something. You'd be surprised how many operational challenges we encounter just doing this much," the manager reports. The second tier is the same as the first, except the team is now trying to recover the database to a remote location. "When the DBA says 'Yes, the data is here,' then we know we can get the data over and loaded in one piece," the manager explains.
The third tier is a three-day recovery exercise of the vertical business function where a disaster team visits the hot site, loads tapes and runs the application. "We're talking about hundreds of gigabytes of data, sometimes even a terabyte, so we can spend a whole day just waiting for the tapes to load," he explains. It's not unusual that a glitch in the tape loading forces a complete reload. "By that point, we end up with maybe a few hours on the last day to run the application and the data," he explains.
But even all this third-tier testing doesn't amount to a true business function test. The team still hasn't put a user load on the application or tried to enter new transactions. "To do that, all the data sources would have to be remapped, and we haven't the time and resources to do that," the storage manager says. The third tier has proven so challenging that the telecommunications company is delaying the requirement for full tier-three testing until next year. The cost will run into the tens of millions of dollars, but that's still less than the value of the loss of just a few days' billing.
Without testing, managers can't assume with any confidence that their recovery strategies will work. Still, what managers mainly learn in the tests is how many IT infrastructure points of failure they have. Few have even begun to test the human elements of disaster recovery.