Published: 10 Dec 2006
Is it really a disaster?
The first step in an effective disaster recovery plan is defining what constitutes a real disaster.
What happens when you say "disaster recovery" in a crowded room? Everyone thinks of something different. That's because the term--like so many others used in the IT world every day--lacks precision. At the very least, we need to clearly define what we mean by disaster recovery.
Thankfully, not all disasters are created equal. I once stepped on the power cord for the storage array I was managing. It popped out of the socket, crashing the array and taking down a data warehouse in the middle of the day. You can bet I called that a disaster! But the mainframe staff didn't even notice.
Because all disasters aren't equal in importance, it can be hard to decide just what type you're dealing with when an outage occurs. The first thing a company has to do is decide what constitutes a disaster. In general, if something is localized in scope and time (like my fancy footwork), we call the response an operational recovery. This includes outages of just a few systems. But if the number of affected systems and the timeframe for recovery is sufficiently large, it constitutes a real disaster.
It's critical to differentiate between operational recovery and disaster recovery because the tools and techniques used in each situation can differ significantly. Many IT systems are designed with high availability in mind, using technologies like N+1 power supplies, redundant connectivity and clustering to reduce the likelihood of an operational outage. But there are some additional steps you can take to prepare for an operational outage.
One of the most effective is to create frequent disk-based snapshots of the running environment. Many operational outages are related to data loss or corruption, which wouldn't be prevented by a high-availability system architecture. By using disk snapshots, you can quickly revert to a previous version of the system state. Continuous data protection technology is a newer method for dealing with operational outages, allowing for a much more fine-grained selection of the recovery point.
When enough systems are affected by an outage, it may be time to declare a disaster. We've recently seen a number of full-scale disasters caused by weather, power failures and terrorism. In these cases, the affected companies determined that it would take too long to attempt to recover the systems in place, so they began recovery operations at remote locations. Remoteness provides the greatest challenge for disaster recovery: How can a company ensure that a complete copy of its critical data will be available for recovery given the high cost of wide-area network bandwidth? This is the key technical challenge addressed by most disaster recovery products from replication to wide-area file services.
There's recovery, and then there's recovery
When problems progress far enough that we declare a disaster rather than just an operational problem, we have to consider what we're recovering. This is what often trips up IT folks: True disaster recovery should include much more than simply data recovery. In fact, true disaster recovery must include resumption of all business activities, including personnel and facilities, as well as technologies like storage, servers, applications and communications.
But in IT infrastructure, we can only concern ourselves with the technical elements of the disaster plan. As long as we communicate with the rest of the business that other areas aren't included, we can focus on doing our part. Even when restricted to applications and system infrastructure, we can quickly see that disaster recovery extends further than the traditional storage realm. Whole applications must be considered, which can include multiple elements on many different systems. And we must understand the requirements of the systems' users, something IT is often ill-prepared to do.
It can be helpful to consider business expectations when thinking about recovery requirements. Operational recovery will often require shorter recovery times or involve less data loss than a true disaster. Of course, people tend to be understanding when a large-scale disaster hits a company. Frequent operational outages will need to be addressed much more quickly and data loss contained, or the end users of the services will quickly lose faith in our abilities to protect their data. But since a site-wide disaster is rarer, it may be acceptable to bring only a subset of applications online or to work with less data than usual.
Another factor is the time required to actually recover data and restart systems. Even with a crate of backup tapes and sufficient bare systems, as would be provided at a contract disaster recovery site, recovery can take a surprisingly long time. Assume it will take up to 10 times as long to restore data from tape as it took to write it to the tape. This may seem excessive, but my own experience in disaster recovery tests has shown that tape-based data recovery is finicky to the extent that it can become nightmarish. Identifying tapes, locating the correct equipment, loading cartridges, indexing data and restoring can lead to numerous restarts, especially when an unfamiliar location is used.
When preparing a true disaster recovery plan, we must be pragmatic about the prospects of a disaster happening as well as our real capabilities. If only a few applications can be restored in less than four hours, a few more in less than a day and most others taking a week or more, the applications must be clearly classified so that those with the shortest term requirements can be dealt with first.
Payroll is a good example of an application that might be somewhat tolerant in the event of a real disaster. Most payroll companies will, on request, duplicate the previous period's payments to ensure that employees get paid despite the IT disaster. Also, many payroll processors keep the data at their end, so a reinstall of the application is all that's needed to update records.
In fact, most companies have very few applications that truly demand recovery in less than a few hours in the event of a site-wide disaster. Point-of-sale, flight operations, fund trading and similar time-sensitive systems will probably need this type of reliability, but many can be protected in more creative ways, such as by using a wide-area database cluster as a recovery solution for critical systems. If these ultra-critical applications can be removed from the disaster recovery solution, there may not be as much to protect as first thought.
You've recovered from a disaster, but now what? How long will you continue to operate out of your disaster site? What kinds of operations will be required there? And how will you get back up and running in your main site? These questions can prove to be especially thorny.
One of the key systems often left out of a disaster recovery plan is the backup server. But the inability to protect data could be a serious omission, especially in regulated businesses. Make sure you have a plan in place to get backups going again after a few days, with appropriate tests to make sure the backups will function properly and contain a useful data set.
You may also find that you begin to outgrow your disaster site fairly quickly, especially when you consider the various concessions made to get things up and running quickly. If you intend to host your VMware images on fewer physical systems, make do without dual Fibre Channel (FC) SAN paths or use SATA disks instead of FC at your disaster site, you're likely to hit a performance wall. Consider how long your business could tolerate limited functionality and performance, and what might happen as demand grows and you need more storage.
The only real solution is to get running normally as quickly as possible. But many disaster recovery products, especially storage replication technologies, lack the ability to "fail back" after a disaster-induced "failover," so check with your vendors to ensure that their products handle disaster fail back satisfactorily. And, of course, you must have somewhere to fail back to, whether you return to normal operations at the original location or a new one.
Disaster recovery is complex. It's essential to recognize your real capabilities, but it's also important to extract honesty from the rest of the business. When will we consider a mere failure to have escalated to the level of a real disaster? And what diminished expectations would accompany an honest site-wide disaster as opposed to a simple operational failure? By answering these questions, your plan will progress more rapidly than it would by concentrating on technical solutions alone.