Failover and failback operations can be crucial to the success of a disaster recovery (DR) plan. Jeff Boles, Senior Analyst with the Taneja Group, discusses the significance of failover and failback operations in the server virtualization environment to help organizations maximize their DR efforts.
You can read his answers to these frequently asked questions below or download a recording of the Q&A below.
Table of contents:
Server virtualization has some interesting implications for DR and I think it has driven a lot of organizations to consider it more seriously over the past few years. Not long ago, DR meant shipping tapes out to remote sites, loading them and then sending them on a plane.
Those were the types of DR steps we took a few years ago and we've fairly aggressively moved to replication technologies; whether this is via storage arrays or host-based agents. We moved first into replicating data, via a storage array or a host-based tool for a remote site, supporting that at the standby server. But when we started doing that, sticky issues remained in managing physical servers across sites.
First, physical machines have lots of hardware dependencies and create many opportunities for hardware dissimilarities across sites. You get a couple of machines out of sync in your DR site -- maybe because you replaced them during the last cycle replacement. There's a good chance that your replicated drives won't even come up on these remote machines. Moreover, often, there's no clean way to address replicating boot volumes. A few organizations have turned to boot from storage area network (SAN), allowing the use of sophisticated replication technologies. Many organizations are stuck with replicating local server boot volumes with local agents. This is hard to manage when you start distributing thousands of agents across your environment. There's lots of I/O impact, but some solutions offer opportunities for reducing the size of replicated data streams across sites, saving valuable bandwidth.
Now more organizations are looking at server virtualization at primary and remote sites to ease server configuration issues and optimize the use of DR site resources as well. More organizations do this with DR in mind because of TCO-lowering DR management tools like VMware Inc. Site Recovery Manager. This allows easier management testing if you're using virtualization at both sites and can increase flexibility and the amount of use of the remote site when you're not using it to support a disaster.
So, when you move into a server virtualization environment at your primary site, server images become much more portable because they're not tied to specific hardware any longer, they're now encapsulated in this server virtualization environment. That server virtualization environment becomes the same at the remote site, regardless of the dissimilarities in underlying hardware.
As long as you have your host environment, your VMware ESX Server or your Microsoft Corp. Hyper-V Server set up right, that hardware is going to appear the same as the virtual image. I also think part of this is behavioral because server virtualization encourages us to separate data from virtual server boot volumes and encourages us to SAN-attach our virtualization environment in order to make more use of server virtualization features like moving host images back and forth across hardware.
When virtual images are stored on a SAN, it's easier to move them across hardware to get a load-balanced environment. Then, when they're on a SAN, it's easier to make better use of data protection and replication tools like snapshots. These can be array-based from a storage vendor like IBM Corp., Hewlett-Packard (HP) Co., Compellent, 3PAR Inc., EqualLogic, LeftHand Networks Inc. Or, these tools could even be from a heterogeneous storage virtualization vendor like DataCore Software Corp., IBM or FalconStor Software.
Some of these tools even offer broader reaching features, including their own functionality such as snapshots and replication, or they can be single point management. One example is DataCore. That product can even port virtual images from one format to another and allow you to protect your ESX environment with Hyper-V at another location. That might be cost-optimal for you.
That's an extreme example and you're likely to face some compatibility issues, but there are other features to be had on top of these snapshot and replication tools. Regardless, in moving to server virtualization, you can harness all the capabilities of your storage environment, replication tools and snapshot tools. You can also ease the copy of data to your DR site to get closer to the moment before your disaster happens and make sure your data and remote site has more integrity.
There's lots of-third party stuff out there. VMware Inc. has its Site Recovery Manager that's a management framework work detecting failures and bringing up a DR environment on top of VMware that is a powerful framework for automating, managing and testing.
One solution that I've become particularly fond of because of its capabilities, richness and features is the replication solution InMage Systems Inc.. It has a very lightweight I/O tap agent that goes on a virtual desk and can replicate data to an appliance at a local site. This reduces the amount of data that is sent across a WAN but still captures all of the replicated data at the host level.
So you have very good data integrity but very low I/O overhead going onto a virtual desk. One of the challenges with agent-based replication solutions has always been the I/O load that it places on the virtual desk, but also the management overhead that it creates in the environment. InMage's solution is very low on I/O impact but still harvests the benefits of agent-based replication. The benefits of agent-based replication include better data integrity because the agent can understand what is going on at the host.
If you have Exchange on that host, an agent can usually recognize when you asses the host and have taken a snapshot. So it knows you have a usable server at that point in time versus an array-based replication tool that is just replicating data and can't tell you whether that server image is usable at any point in time. If you corrupt it, or there is a disaster that corrupts it, that data may get replicated as well.
InMage has a very interesting solution for replicating at the host level that is very lightweight and captures points in time where data is good. Because it consolidates all the streams from multiple guests to the site and it replicates that over a WAN so you don't get as much data change going on as you do when you're replicating a whole bunch of different streams.
That's one solution, but others exist as well, from vendors like SteelEye Technology Inc. And then if you're dealing with sticky issues around moving physical servers to virtual images, there are vendors out there like Racemi Inc., Platespin (which has been acquired by Novell), IBM Corp. and VMware Inc. with their P2V solutions. So there are some interesting possibilities for doing that hybrid physical environment to virtual environment at a remote site.
For organizations looking to server virtualization because of DR, there are a couple of things to think about. You're definitely coming from a physical environment if you're just thinking about server virtualization for the purpose of DR. The big questions are how heavily do you move into virtualization for the purpose of DR? Do you do it at your primary site in a big way, too, as well as your remote site?
Clearly server virtualization at the remote site offers some tremendous benefits because you can use fewer resources at that remote site and you can likely make better use of replication technologies to move data over there. But at your primary site, there may be more sticky issues to deal with.
You need to think of I/O at your primary site and whether server virtualization is a good fit for every application you have in your environment. These days, you have powerful enough hardware and strong enough options around I/O that even heavy I/O apps in lots of environments can be virtualized like Exchange, SQL Server and some of the things folks have been hesitant to virtualize in the past. That will facilitate doing DR because once those machines are in a virtual image, it's easier to move them.
But if you've got a machine that you're absolutely against virtualizing, you're going to have to deal with a few sticky issues like converting them onto DR servers or else just dealing with a partially virtualized environment at your DR site.
So I see two approaches there. You're going to have full virtualization of your server environment and everything you need to recover at your primary site and your remote site. Or, you're going to have a partially physical environment at your primary site and a fully virtualized environment at your DR site. If so, you're going to have to deal with replicating or putting the physical servers into a virtualized environment at your DR site and there are some sticky issues there.
You need to look at some type of third-party solution that will allow you to grab that physical image and move it into a virtualization environment that's going to have some different drivers and you're going to have to be more stringent about your testing. If you can virtualize completely at both locations, your DR is going to be a lot easier to manage.
It sounds like you're talking about fully investing in server virtualization and DR at both sites or a mix of physical and virtual between sites. What are the challenges and benefits associated with each approach?
For organizations that are looking to virtualization for DR, moving fully to virtualization for both your primary and recovery site is certainly going to be more of a challenge because there's going to have to be a lot more roll out at the primary site. There is an attractive opportunity for organizations to look at virtualization for their recovery site and still consider keeping a lot of resources physical at their primary site until something changes in the future.
So in a nutshell, virtualization at the DR site does make sense because there are some benefits to be had there. Virtualization at both sites certainly makes things much easier, but there are benefits even if you just turn to virtualization at the DR site. Those include easier management of the DR site (all the virtualization vendors have tools for managing the server better, for booting servers remotely and reducing the resources you're going to need at your DR site). So there is a tremendous cost-optimization opportunity there if you can consolidate a bunch of physical resources that might not need the same physical resources underneath them. They might not need as much I/O performance or CPU performance in a true disaster, maybe it's just a service that you have to sustain even in a degraded mode. So, you can collapse your 50 or 100 servers onto 10 or 20 servers that are virtual servers at the virtualized host, at a remote recovery site.
Organizations also need to realize that the cost optimization is more than just consolidation, because server virtualization allows you to easily repurpose those resources too, so when you're not using them for DR, you can easily use them for test and development. So boost those systems up and use them for test and development.
If you have a DR site, most of the management frameworks will automatically shut those servers down and bring up another set of servers and failover. If you're using those servers for other purposes, there may be more disruptions than a DR event, but that may not be an issue for you considering the cost optimizations.
The challenge with going from physical to virtual though is that there are sticky issues like compatibility between a physical server image and a virtual server image, as far as drivers and that type of thing go. You also have to deal with I/O on your physical servers. You've got to be a little bit sensitive to how much load you're going to put on a server host when you consolidate 20 physical servers on that host in a DR situation.
You've got to observe your performance load in a production environment so you're prepared to right size your DR environment, but otherwise, it's a great opportunity to use server virtualization for DR.
Jeff Boles is a Senior Analyst with the Taneja Group.