This content is part of the Essential Guide: Server uptime and hardware failure guide

High-availability features over disaster recovery? Not so fast

Jon Toigo discusses why, despite their benefits, high-availability features are not a full replacement for a good disaster recovery plan.

Looking at the latest marketing materials from the hypervisor vendors, you’d think high-availability features had become so reliable and powerful that you no longer needed any disaster recovery strategies. Not so fast, I say.

First off, let’s hear the argument put forth by the vendors: Considering that 95% of interruption events do not involve catastrophic facility failures (fires, for example) or broad geography disasters (hurricanes, earthquakes, nuclear plant catastrophes,) HA clustering is all you need to cope with most interruption events. So, high-availability features trump DR, right? Never mind that HA models must work perfectly, all the time, to deliver as promised. There must be bullet proof processes for failover, beginning with a foolproof "heartbeat" channel that signals the viability (or lack thereof) of the servers included in the cluster relationship.

OK then. Let’s assume that such a relationship can be validated, then we have an additional requirement: On-going data mirroring must occur between the storage associated with each cluster member. Then the mirrored data must be synchronized and share a common state. Finally, no single point of failure can exist in the data storage infrastructure. Sit with that awhile and you’re likely to realize that hanging your high-availability hopes on some basic server clustering software is likely going to lead to the realization that there is no actual HA capability at all. I know what you're thinking. And no, going to direct-attached storage behind each server -- even the multi-nodal kind described in most software-defined storage or hyper-converged infrastructure models -- does not solve the problem. Such architecture may solve the problem posed by one single point of failure (that of a shared storage array cabled to both servers in the cluster,) but it introduces more single points of failure -- whether in the storage nodes and their interconnects, or in the data mirroring process that copies data behind each server, which is itself controlled by a single software-defined storage layer instance, in most cases.

Alternatively, if you attach each clustered server to one or more SAN switches and create multiple paths to the back end storage, this strategy may resolve the problems associated with shared or isolated direct storage. However, it also multiplies the number of potential failure points and the costs to make key interconnect and storage components redundant. Start to resolve these single points of failure, and pretty soon, you are talking real money.

I say that applications, and by extension their data, deserve HA only when they are mission-critical.

This isn’t to suggest that high-availability features are without merit. My observations serve only to demonstrate that HA, which has always been part of a spectrum of techniques for shortening “time to data” in disaster recovery, remains one of the most costly and challenging strategies to deploy, configure and maintain over time, and is now (as in the past) a strategy best reserved only for apps that require such protection.

I say that applications, and by extension their data, deserve HA only when they are mission-critical.

Some questions that need to be investigated when considering the high-availability features of software-defined and hyper-converged storage models proposed for use behind virtual servers should include the following:

  1. Does the solution enable mirroring of data across non-identical hardware? If the mirror requires an identical kit on each end, this means all changes made to either storage kit will need to be made to the other cluster node or nodes. That can be costly and time consuming.
  2. Does the solution enable centralized management of diverse data replication processes for ease of management? SDS and hyper-converged infrastructures typically deliver storage services and resources to a specific hypervisor’s workload only. If you have multiple hypervisors, and some applications that aren’t virtualized at all, you will likely have multiple mirroring processes that could become as difficult to manage as herding cats. Look for a solution that lets you at least manage replication processes from a common management process, even if the storage is isolated behind different hypervisors.
  3. Does the SDS solution offer mirror transparency -- visibility into mirroring processes for ease of testing and validation? Do you have to stop the application generating data, flush caches and buffers to disk, replicate the data to the passive cluster storage, then stop the cluster and mirror to check whether you are replicating the right data? With traditional clustered mirroring, this is the case -- and it is the leading reason why IT administrators fail to test their mirrors. Look for a solution that lets you test your mirror integrity without such disruptive processes.
  4. Does the solution automate the failover of data access between different media, arrays or SANs that are part of the SDS-administered infrastructure so that the re-routing of applications to their data requires no human intervention? If you are deploying storage using a software-defined or hyper-converged model, it is typically direct-attached to a server -- and isolated. Using vMotion or a cut-and-paste technology to move application workload from one machine in a cluster to another should automatically trigger the re-hosted app to seek out local storage for workload data -- without requiring manual intervention to identify the volumes containing the data to the re-hosted application. Is this functionality provided with the solution you are considering? Does the solution provide the means to failback from a failover once the error condition that necessitated the failover is resolved? Cluster failover is only partly about failing from a bad node to a good node. You also need to consider how you will fail back when the bad node is repaired or replaced. Is the necessary functionality provided?

Next Steps

Both DR expectations and available options are changing

The importance of DR in the software-defined datacenter

Dig Deeper on Disaster recovery planning - management