Automatic failover for disaster recovery is an option for organizations that need a zero downtime environment, and there are a number of approaches available in the market today. Andrew Burton, senior editor of SearchDisasterRecovery.com and Jeff Boles, senior analyst with the Taneja Group, discuss who should consider automatic failover, the vendors that offer products in this space, how the technology works, and the challenges associated with automatic failover.
You can read the text below or download a recording of their conversation.
What is automatic failover and who needs a solution like this?
Automatic failover is the process of automatically moving an application to a standby server during a failure or service event to preserve its uptime. It can be part of a high-availability approach, or part of a disaster recovery (DR) approach, depending on where the second system is, and how it is used.
I want to be clear here—we are talking about "automated" failover, and that excludes many DR approaches that rely on a manual trigger. Generally, users will want a system like this if they are trying to build a downtime-free environment, and such approaches come with some caveats. There's often some complexity involved, and some serious trust in your architecture in that you are turning over control to an automated process that is dependent upon some mechanisms like heartbeats, etc. running in your environment. Today, you choose automatic failover because you have very little downtime tolerance, and there will be some complexity, cost, and risk that goes with that choice.
What are the major vendors in this space?
With that broad definition, there's obviously a spectrum of solutions involved here. There are software and hardware approaches that can be used for high availability, even to the point of including those rather unique redundant server designs by NEC and Stratus, but also including HA clustering software from many different vendors. But for the rest of this conversation, let's refine the definition a bit.
Well, since we're talking on SearchDisasterRecovery.com, how about we focus on DR failover. Can you give us a shorter list of DR vendors?
So the list here is still pretty long, and I won't do everybody justice with my quick flyby, but there are a few different buckets with vendors who stick out. On one hand, you have VMware Site Recovery Manager (SRM) for virtual environments. Today, that orchestrates a lot of failover but still requires data synchronization underneath. SRM can basically turn replication from a whole bunch of different vendors into failover if you have a virtual environment. Now with VMware SRM, we're in a grey area, because while it is highly automated once it begins, it still requires an administrator click, and it will incur some significant downtime during the transition as VMs are moved and started up at a remote site. Even so, working alongside things like EMC's RecoverPoint, InMage, or FalconStor's NSS and CDP technology, SRM can be pretty powerful as a DR-oriented failover and failback automation tool.
But SRM still requires some intervention, and if you're looking at truly automated and real low downtime, today there are a few other solutions that fall into two buckets. On one hand, there are application-based approaches—such as Microsoft Failover Cluster, which can be used underneath workloads like Exchange and SQL Server, Oracle Data Guard, or others, depending on the application.
On the other hand, you have a bunch of host agent type approaches from the likes of CA's XOsoft to DoubleTake to Marathon—and the list is a lot longer than that. Basically, these solutions work by synchronizing data between hosts, using a heartbeat from a management server to hosts, and then conducting a failover if one goes away.
Are there any other vendors or more recent developments that we should be keeping an eye on?
You also have a bunch of cloud-oriented vendors coming at this space, which aim to replicate and recover an environment in the cloud. I think Doyenz is the reputational leader in this space today, but the enterprise-geared service providers are all chasing this space with an assortment of well-integrated solutions, and we also have a bunch of cloud gateway storage vendors who are chasing this space with various technologies, and backup vendors are increasingly shuttling data off to the cloud where it can be recovered to a DR server. Those solutions aren't nearly automated yet, but in the next couple of years, they are some of the vendors to watch in this space.
You can see it is a complex space. But for truly automated failover and very low downtime, it basically boils down to application-level technologies like Microsoft Failover Cluster, or agent-based approaches. For bigger sets of technology, you can construct something psuedo-automated, with low downtime, with some of these newer technologies on the market.
There’s obviously a considerable investment in hardware that comes along with this type of solution because you have to set up a site to fail over to. Do you have to have identical hardware at your primary and DR site for this to work?
No, not necessarily, and that's where server virtualization or the cloud both hold tons of potential. When you pair those things together with the ability to replicate data in between environments, you could use a much smaller environment at your remote site, and you don't need to worry about keeping the hardware in lockstep. Since virtual servers will run on different hardware, you can envision using older servers or a cloud provider's entirely different infrastructure to host your failover site that you may never use. Meanwhile, even when it comes to highly available stuff, there is some potential for using standby hardware to fulfill other roles until the time of disaster, if you have a tool that can automate shutting down those low-priority workloads. Again, that is where the appeal of VMware's SRM resides.
It seems like any product offering automatic failover faces some serious challenges. How do products solve bandwidth and latency challenges and make sure data is consistent?
For sure, that's the biggest challenge with these tools, and it is a challenge that storage vendors have been hard at work tackling with their replication tools for many years. That's why you see toolsets like SRM leveraging those tools. The latest storage replication tools have some powerful capabilities. Things like EMC's RecoverPoint, Falconstor's CDP, InMage, or IBM's FlashCopy replication, can mark points in time and make sure even across high latencies and low bandwidth that the remote side of a solution can recognize a known good point in time, and recover to that point, even if it means loss of the very latest data.
That ultimately is a trade-off you have to consider—if you don't need to-the-second data and failover, these solutions offer tremendous capabilities. They do this with a variety of powerful data transmission optimizations, including delta-only transmission, compression, deduplication, and snapshot-only-based synchronization.
But if you're looking for true automatic failover and minimal noticeable downtime, you're going to be throwing lots of money at high-bandwidth, low-latency, dedicated links that can handle synchronous data transmission. In such cases, there may be optimization technologies that you can throw at it—ask Riverbed what they can do on top of EMC's SRDF, for example. But nonetheless, at the end of the day, it comes down to brute horsepower to get to no data loss, disruption free-style failover.
This was first published in June 2011