Published: 17 May 2006
Your company may have formulated a disaster recovery plan and invested in the technology to support it, but that might not be enough to ensure that you can recover data.
As the frequency of natural and man-made disasters has increased over the last few years, storage managers' disaster recovery (DR) plans are being scrutinized and undergoing much more testing and refinement.
That's the good news. The bad news is that most of those plans will fail. What follows are the top 10 reasons why most DR plans will fall short of protecting a company's data.
1. There's no DR plan.
If it weren't for CNN, most people would probably never think about disasters. Storage managers focus on day-to-day issues such as system performance and availability. Backups get more attention than DR in most companies because even a moderately sized company will experience the need for operational recoveries every once in a while.
Few people have been through a disaster that takes out their entire data center or campus. The only way to solve the problem of not having a DR plan is to dedicate one or more people to building one. In addition, these people should be devoted to ensuring the existence and success of the DR plan.
2. There's no business-continuity plan.
Many people confuse DR and business continuity (BC). The purpose of a DR plan is to ensure that a company's computer systems (and the data stored on those systems) can be made available after a disaster. But there's much more to a business than its computer systems. There are people, communication systems, product creation and delivery systems, as well as many other systems that aren't included as part of a DR plan.
It's possible to have a perfectly functioning data center and still have a non-functional company. The computers may be up and running, but the people who make or deliver your product may not have the physical resources to do their jobs. Or your company's customers may not have a way to contact your sales or customer service departments. In short, the DR plan might have been executed just fine, but your business can't continue. So, you must ensure that any DR plan is coupled with a BC plan. Again, one or more staff members should have sole responsibility for developing a BC plan that integrates with DR planning.
3. There's a mismatch between business needs and IT's priorities.
Storage departments tend to focus on systems that are hard to back up. For example, a backup and recovery specialist spends a significant amount of time ensuring that the data warehouse is successfully backed up. The specialist justifies the time spent on the system because it's the largest one in the environment and, therefore, the most problematic. Right next to that system is a customer database that consistently backs up without skipping a beat.
If someone from the company's business side was asked which system should get more attention, they would surely say the customer database. While recovering the data warehouse after a disaster is important, the business users are likely to say that it's the last system that needs to be recovered, so a two- or four-week recovery time objective (RTO) is perfectly acceptable. However, they'll expect the customer database to be up and available within minutes of a disaster.
Because of this mismatch between business and IT, the customer database is being backed up in a way that gives it a recovery point objective (RPO) of 24 hours and an RTO of eight hours—not the minutes its users expect. DR priorities, and the infrastructure and processes underlying them, must be aligned to meet business needs. And someone must be responsible for ensuring that those priorities are at the core of the firm's DR plan.
4. There's a mismatch between requirements and reality.
The person responsible for defining business requirements also has to understand that not all things are technically possible. Most business people would ask for an RTO and RPO of zero seconds or, if they were in a good mood, a few minutes. They want all data accessible immediately after a disaster, and they can't afford to lose any data along the way. While this is actually possible with some technologies, the chief financial officer might not agree that the cost of these technologies makes it viable to use them on every computer in the data center.
Therefore, while it's important to ensure that business needs are met, it's also important to ensure that the requirements are realistic. Gaining consensus on this issue requires negotiations between business users and IT. The conversation typically goes something like this:
User: "We need an RTO and RPO of one minute."
IT person, while placing his pinkie next to his mouth like Dr. Evil in an Austin Powers movie: "That's possible. But it'll cost $1 million."
User: "That's too much money. What about a four-hour RTO and RPO?"
IT: "We can do that for $10,000."
User: "It's a deal."
If conversations like this aren't occurring on a regular basis in your company, then your DR plan is almost certainly mismatched with the requirements of the business units.
5. The DR plan specifies only system RTOs, not data center RTOs.
Most companies that have negotiated RTOs have negotiated these numbers for individual systems. For example, a common RTO is that any system in the data center must be recovered within four hours. While this works fine for operational recoveries and system-level outages, it doesn't work when the entire data center is lost. It's usually assumed that a system that needs to be recovered is given access to all system resources. For example, a large database server that needs to be recovered is given access to all 20 tape drives in the tape library. But what happens when 20 or 100 servers need to be recovered? They can't all be given access to all 20 tape drives in the tape library.
This is quite possibly the most difficult conversation that needs to occur between IT and those business units that need a DR plan. It brings to light one of the core problems with traditional backup and recovery: In a true disaster, it's fairly certain that the storage department isn't going to meet its RTOs or RPOs. Unless your company is able to live without its data for several days, the only way to have an entire data center's data available after a disaster is if it was "recovered" before the disaster happened. Traditionally, this has been accomplished with replication. Depending on the amount of data, it can also be accomplished with other technologies. But realistically, the working RPOs and RTOs of most companies don't allow the recovery of an entire data center to begin after a disaster.
6. The DR plan doesn't contain consistency groups.
Multiple computer systems often serve multiple business applications. The systems need to be restored to the same point in time. But typically, this is another area where there's a mismatch between the needs of the business and IT's capabilities. Unless consistency groups (a consistency group is a collection of volumes across multiple storage units that are managed together when creating consistent copies of data) have been created, it's highly unlikely that there's technology in place to recover a group of systems to a single point in time. This is another dirty secret about the way most backup systems work. It's not possible to create a consistency group without advanced technology such as snapshots, replication or a continuous data protection system.
7. The backups don't work.
Sometimes the root of a DR problem is much simpler. There are some companies that have DR plans with realistic RTOs and RPOs, but if their backup system didn't work just prior to the disaster, the DR plan won't work. Many backup systems are either partially or completely broken. Even a good backup success rate of 95% means that 5% of the data center isn't backed up on any given night. And very few companies know what data constitutes that 5%. It could be a random number of systems on a particular night, which might not be so bad. But it could be the same systems failing every night. It's amazing how many companies don't track consecutive failures in their backup system. Even more inexcusable is the number of backup software packages that don't track this vital information.
A core problem that causes many backup system failures is how the backups are managed. The person responsible for this job is rated on the success of the backup system. While this might sound like a good idea, it usually puts undue pressure on a junior staffer. The result is that many backup people try to hide the number of backup failures from management. They think if they tell anyone how bad the backups are, they'll be fired or lose their bonus. They're sure that the backups will be better tomorrow, next week or next month, and that they'll be able to fix the backups before anybody notices how bad the situation really is.
The solution to both of these problems is commercial data-protection management software. These products provide information about backups that's not easily accessible, starting with a report that lists consecutive failures. Some of the products also help you to better understand the reasons behind the failures, so you can fix the core problem instead of continually repairing the symptom. Finally, these products remove the backup system administrator's ability to hide failures. There are no more secrets about the backup process when even the business owners can open a Web browser and see how the company's backups are performing.
8. The DR plan isn't tested.
Many companies have a DR plan, and it may have realistic requirements that are technically possible. However, because testing a DR plan is extremely time-consuming and costly, they aren't usually tested often enough (which is at least twice a year).
Think about how much a data center changes in a year. It's only when you test your DR plan that you find the chinks in its armor. A brand-new application, for example, may not have been included in the plan; your documentation may match the version of the software that you stopped using three months ago; or you upgraded your servers, but the hot site didn't upgrade as well. The only way to find out these things is to test the DR plan.
9. The usual people might not be available.
Companies assume their own personnel will be available for the recovery operation. But think about all the different types of disasters and consider whether this assumption is reasonable. Perhaps your disaster involves significant amounts of water, snow or wind. Any one of those can make it impossible for an employee to make it to the office after the disaster. If a bridge is out or a road is impassable, your key personnel may be prevented from getting to the recovery site. If your DR plan assumes that your personnel will be available, it assumes too much. You must make accommodations for retaining knowledgeable, temporary personnel in your DR plan.
|Learn more about disaster recovery|
A good place to start when considering a disaster recovery (DR) plan is with one or more of the various directories for DR services. Simply looking at the table of contents on these sites should help you realize how much work goes into constructing a comprehensive DR plan.
Check out the "Disaster Recovery Yellow Pages" at www.disaster-help.com/toc.html, as well as the Edwards Disaster Recovery Directory from Edwards Information LLC at http://disasterrecoveryyp.com.
And don't forget that DRI International (www.drii.org) certifies DR and business-continuity professionals.
10. The documentation isn't up to the job.
If you assume that your personnel won't be available in a true disaster, then your documentation better be first-rate. In the real world, most documentation isn't. Poorly maintained documentation may be out of date, refer to a software version that hasn't been used in months or years, or cite names of systems and commands that no longer exist.
A long-term employee, who has worked at the company since the last time the documents were updated, could probably work through not-quite- up-to-date documentation. However, temporary and newer personnel need current documentation.
To avoid this problem, you need to update all of the documentation in your storage environment. You should also use temporary people during your DR tests to ensure that the documentation is comprehensive, accurate and understandable. If people who don't work at your company can figure out what to do in times of disaster, then your documentation is obviously pretty good. If they can't, your documentation needs work.