Essential guide to business continuity and disaster recovery plans
A comprehensive collection of articles, videos and more, hand-picked by our editors
In June, the National Hurricane Center announced that the 2015 Atlantic hurricane season promised to be a quiet one. If true, that is good news. The not-so-good news is that a combination of erred projections and comparatively quiet storm seasons in 2013 and 2014 appear to have lulled a lot of folks into a laissez-faire attitude when it comes to disaster preparedness. I'm seeing more Alfred E. Neuman (you know, that Mad magazine "What? Me worry?" guy) attitudes than any time that I can recall -- and it's a little frightening.
For the record, National Oceanic and Atmospheric Administration (NOAA) estimates are just that: estimates, based on past records of climate and weather behavior. Some claim that climate change is taking a toll on older models by changing the level of particle energy trapped in the upper atmosphere. This may limit the relevance of cyclical events -- such as El Nino, changing atmospheric pressures and wind speeds, and so on -- in determining the likelihood of storms. Not that the science of storm prediction has ever been that exact: For example, NOAA predicted an active hurricane season in 2013 that generated virtually no severe weather. And, some of the worst storms, including Hurricane Andrew in 1992, occurred in seasons when NOAA predicted low activity. Yet, many of the companies I visit are pulling back from making any sort of disaster recovery (DR) plans at all.
Some business and IT folks have told me that technology improvements are negating the need for DR planning. In many mainframe shops, folks are telling themselves that running IBM's Virtualization Engine (TS7700) will enable them to jump to a recent RUN (Tape Rewind Unload) point to recover environments automatically following an outage. In other words, they believe it is as simple as rewinding a process to a point just before an interruption occurred and restarting it from there. Of course, reading the product's Redbook provides a very different point of view. You need to consider a lot of factors besides RUN points for a restart, and careful planning and testing is required to make sure that you have everything necessary to restore operations. The oversell by aggressive vendor reps, combined with selective hearing of consumers, is likely to cause folks a lot of grief if and when they need to restore the business.
The real meaning of business continuity
In the x86 world, the same problem seems to be developing. Some hypervisor marketing woo assures users that "DR is dead" and that HA architecture (meaning failover clusters) has eliminated the need for such processes. VMware has actually started referring to its failover cluster configuration as built-in "Business Continuity."
Of course, business continuity actually means something besides failing over workload processing between a pair of clustered servers. According to ISO standards bearing the title "business continuity," the activity aims at recovering not only technology infrastructure and data, but also business processes, personnel and workplace facilities, in the wake of an unplanned interruption event. I pity the fool who believes that failover server clustering means the same thing as business continuity in the ISO sense -- especially if compliance with the ISO standard is being used to satisfy requirements for legal or regulatory compliance.
Let me make the point more directly. The catastrophic loss of data that is required for preservation under, say, the Health Insurance Portability and Accountability Act (HIPAA) could very well create a two-fold calamity for a medical firm. First, of course, the operational costs could be huge as the data loss could jeopardize the healthcare of patients. Also, there might well be additional legal ramifications if the firm has asserted conformance with the ISO 22301 standard on business continuity, and "conformance" was actually based on a hypervisor vendor branding its server failover model as "instant business continuity."
Why disaster recovery planning matters
Look, I know that nobody wants to do disaster recovery planning. Additionally, the big smoke-and-rubble disasters, like hurricanes, have traditionally accounted for only about 5% of downtime in IT. The lion's share of outages results from a mix of planned downtime, software glitches, hardware failures, user errors, malware and viruses. So, some sort of failover (with ongoing data replication) may help firms continue operations successfully against the backdrop of the 95% of issues that may result in downtime.
But that isn't business continuity or disaster recovery -- it's disaster prevention. An important capability as well, but not the same thing. DR and BC planners have to consider insurmountable or unavoidable interruptions that take out the business process. You need to consider how operations will be resumed outside of the perimeter of your facility just in case that user access to systems and data are cut off, regardless of the cause.
First, you need to build a data infrastructure that is truly available -- and whose availability can be tested and validated on an ad hoc basis. Data replication and prayer are not sufficient. You need to verify mirrors and replicas, to ensure the right data is being replicated continuously and that the replica is being stored at a sufficiently distant off-site location to avoid being affected by any disaster that might jeopardize the original.
This is something that many companies fail to do. Mirroring data between the direct-attached storage behind a pair of clustered servers, or even between a local pair of servers and a remote cluster that is off-premises, is not sufficient. You need to analyze and discover all of the data -- both application data and supporting files (including hypervisor software, drivers and so on) that will be needed to re-instantiate workloads at an alternate host. Then you need to determine where the target recovery clusters are physically located and you need to understand the impact that distance-induced latency and network jitter may have on the currency and validity of the data copy you are making. The latter requires breaking the mirror process and checking the consistency of the local and remote (copied) data sets on a fairly frequent basis.
Keep your head in the clouds
If your backup target is a "cloud," you need to know where, physically, that service is located. A cloud may be offering top-of-the-line service-level agreements (SLAs) for disaster recovery, but distance will have a lot to do with the ability of the vendor to deliver.
On the one hand, if the DRaaS is accessed via a metropolitan area network like SONET or MPLS, ask yourself whether it is sufficiently distant to avoid being wrecked by the same disaster that hits you -- whether that is a city-wide power failure or a hurricane with a 100 kilometer or greater damage radius. Your data won't be safe if the DRaaS provider is in the building across the street.
On the other, if the cloud service is quite distant and accessed via a wide area network (WAN), it may or may not be suitable for replicating transactional data that is sensitive to "deltas" or differences introduced by latency. This goes for simple failover scenarios as well as more complex failover models involving interdependent systems.
In every case, testing is required. So your strategy should avail itself of both scheduled and ad hoc testing. Many so-called DRaaS services are really "DR as an afterthought" -- hosting service providers who are offering DR as an extra menu option without really understanding what it entails. There are many software vendors right now who are developing "drool-proof" front-end menus for backup or mirroring software, which can be presented to users via a cloud interface. Just making complex data protection software available via a web interface doesn't mean that the vendor knows anything at all about the actual discipline of DR/BC planning or can do a good job of delivering a valid continuity capability.
Bottom line: DR/BC planning remains a challenging undertaking that companies must undertake if they want to survive the interruption events that may represent only 5% of outages, but 100% of financial disasters. You shouldn't deceive yourself about probabilities or frequencies of disaster events. Prepare for the 5% and you will also be able to cope with the other 95%.
BIO: Jon William Toigo is a 30-year IT veteran, CEO and managing principal of Toigo Partners International, and chairman of the Data Management Institute.
A closer look at failover and failback operations
How automatic failover impacts disaster recovery
Find the right failover DR site