What are the major solutions on the market that facilitate DR in a server virtualization environment?
There's lots of-third party stuff out there. VMware Inc. has its Site Recovery Manager that's a management framework work detecting failures and bringing up a DR environment on top of VMware that is a powerful framework for automating, managing and testing.
One solution that I've become particularly fond of because of its capabilities, richness and features is the replication solution InMage Systems Inc.. It has a very lightweight I/O tap agent that goes on a virtual desk and can replicate data to an appliance at a local site. This reduces the amount of data that is sent across a WAN but still captures all of the replicated data at the host level.
So you have very good data integrity but very low I/O overhead going onto a virtual desk. One of the challenges with agent-based replication solutions has always been the I/O load that it places on the virtual desk, but also the management overhead that it creates in the environment. InMage's solution is very low on I/O impact but still harvests the benefits of agent-based replication. The benefits of agent-based replication include better data integrity because the agent can understand what is going on at the host.
If you have Exchange on that host, an agent can usually recognize when you asses the host and have taken a snapshot. So it knows you have a usable server at that point in time versus an array-based replication tool that is just replicating data and can't tell you whether that server image is usable at any point in time. If you corrupt it, or there is a disaster that corrupts it, that data may get replicated as well.
InMage has a very interesting solution for replicating at the host level that is very lightweight and captures points in time where data is good. Because it consolidates all the streams from multiple guests to the site and it replicates that over a WAN so you don't get as much data change going on as you do when you're replicating a whole bunch of different streams.
That's one solution, but others exist as well, from vendors like SteelEye Technology Inc. And then if you're dealing with sticky issues around moving physical servers to virtual images, there are vendors out there like Racemi Inc., Platespin (which has been acquired by Novell), IBM Corp. and VMware Inc. with their P2V solutions. So there are some interesting possibilities for doing that hybrid physical environment to virtual environment at a remote site.
What are the different approaches and the difference between virtualization on both the production and the recovery side, and virtualization only on the recovery side?
For organizations looking to server virtualization
because of DR, there are a couple of things to think about. You're definitely coming from a physical environment if you're just thinking about server virtualization for the purpose of DR. The big questions are how heavily do you move into virtualization for the purpose of DR? Do you do it at your primary site in a big way, too, as well as your remote site?
Clearly server virtualization at the remote site offers some tremendous benefits because you can use fewer resources at that remote site and you can likely make better use of replication technologies to move data over there. But at your primary site, there may be more sticky issues to deal with.
You need to think of I/O at your primary site and whether server virtualization is a good fit for every application you have in your environment. These days, you have powerful enough hardware and strong enough options around I/O that even heavy I/O apps in lots of environments can be virtualized like Exchange, SQL Server and some of the things folks have been hesitant to virtualize in the past. That will facilitate doing DR because once those machines are in a virtual image, it's easier to move them.
But if you've got a machine that you're absolutely against virtualizing, you're going to have to deal with a few sticky issues like converting them onto DR servers or else just dealing with a partially virtualized environment at your DR site.
So I see two approaches there. You're going to have full virtualization of your server environment and everything you need to recover at your primary site and your remote site. Or, you're going to have a partially physical environment at your primary site and a fully virtualized environment at your DR site. If so, you're going to have to deal with replicating or putting the physical servers into a virtualized environment at your DR site and there are some sticky issues there.
You need to look at some type of third-party solution that will allow you to grab that physical image and move it into a virtualization environment that's going to have some different drivers and you're going to have to be more stringent about your testing. If you can virtualize completely at both locations, your DR is going to be a lot easier to manage.
It sounds like you're talking about fully investing in server virtualization and DR at both sites or a mix of physical and virtual between sites. What are the challenges and benefits associated with each approach?
For organizations that are looking to virtualization for DR, moving fully to virtualization for both your primary and recovery site is certainly going to be more of a challenge because there's going to have to be a lot more roll out at the primary site. There is an attractive opportunity for organizations to look at virtualization for their recovery site and still consider keeping a lot of resources physical at their primary site until something changes in the future.
So in a nutshell, virtualization at the DR site does make sense because there are some benefits to be had there. Virtualization at both sites certainly makes things much easier, but there are benefits even if you just turn to virtualization at the DR site. Those include easier management of the DR site (all the virtualization vendors have tools for managing the server better, for booting servers remotely and reducing the resources you're going to need at your DR site). So there is a tremendous cost-optimization opportunity there if you can consolidate a bunch of physical resources that might not need the same physical resources underneath them. They might not need as much I/O performance or CPU performance in a true disaster, maybe it's just a service that you have to sustain even in a degraded mode. So, you can collapse your 50 or 100 servers onto 10 or 20 servers that are virtual servers at the virtualized host, at a remote recovery site.
Organizations also need to realize that the cost optimization is more than just consolidation, because server virtualization allows you to easily repurpose those resources too, so when you're not using them for DR, you can easily use them for test and development. So boost those systems up and use them for test and development.
If you have a DR site, most of the management frameworks will automatically shut those servers down and bring up another set of servers and failover. If you're using those servers for other purposes, there may be more disruptions than a DR event, but that may not be an issue for you considering the cost optimizations.
The challenge with going from physical to virtual though is that there are sticky issues like compatibility between a physical server image and a virtual server image, as far as drivers and that type of thing go. You also have to deal with I/O on your physical servers. You've got to be a little bit sensitive to how much load you're going to put on a server host when you consolidate 20 physical servers on that host in a DR situation.
You've got to observe your performance load in a production environment so you're prepared to right size your DR environment, but otherwise, it's a great opportunity to use server virtualization for DR.
Jeff Boles is a Senior Analyst with the Taneja Group.