Failover and failback operations can be crucial to the success of a disaster recovery (DR) plan. So how confident are you in your organization's failover processes? Jeff Boles, Senior Analyst with the Taneja Group, discusses the significance of failover and failback to a DR plan and provides best practices for ensuring the effectiveness of these operations.
Table of contents:
>>The importance of failover and failback for DR
>>Defining failover and failback
>>Running a system if failover and failback is delayed
>>Best practices for selecting a recovery site
>>Failover operations non-critical business operations
>>The location of the backup facility
>>Major players in the failover/failback operations space
Failover and failback are the secret sauce in executing a DR plan. Assuming that your DR infrastructure is set up appropriately (you have the actual systems you need being replicated and protected appropriately at the secondary location), then failover and failback will in fact be the most disruptive element to your DR execution.
The disruptiveness of this process is usually defined by the amount of data change you have going on at the primary location, your available bandwidth, and how your data is being copied, mirrored or replicated to that secondary location.
If you're an architect, you should be interested in minimizing the data transmission while maximizing the synchronization between sites. Then focus on how to trigger failover while minimizing the time the operation takes. There are a bunch of technologies in this area; some technologies will vary, and that may modify how well you can execute.
There are some continuous data protection (CDP)-type replication technologies out there, such as InMage, that excel in these areas, minimizing data transmissions and maximizing synchronization between sites.
Failover is the process of shifting I/O and its processes from a primary location to a secondary DR location. This typically involves using a vendor's tool or a third-party tool of some type that can temporarily halt I/O, and restart it from a remote location.
This will temporarily halt I/O, suspending data copying and mirroring activity that may be going on from the primary location to the secondary location. This will then bring applications and I/O up from that remote location.
During activity at the remote site, changes are usually tracked so that their original location can be re-synchronized and restored to service by just replicating the data between the start and end of the DR event back to the primary location when it comes back up. Failback is the process of re-synchronizing that data back to the primary location, halting I/O and application activity once again and cutting back over to the original location.
This is really a matter of the operations, availability and the performance level of the DR center and the ability of your staff to support ongoing operations of a remote facility. None of those aspects should be overlooked. Yet, even with the amount of time you spend at a failover operation, its potential implications are often overlooked.
Organizations should consider their DR capabilities in the context of the impact on staff and the long-term availability of the DR center. An extended outage or a far-removed facility may be impossible for you to support with your existing staff. Or, frankly, your staff just may not be available if you have a severe local disaster.
There are lots of facilities out there today. I think more and more of them are gaining on an even playing field with each other and are delivering very similar services. They are becoming commoditized in a similar fashion and it is becoming more an issue of just a cost comparison for similar levels of capabilities between sites. You should keep in mind your requirements while running in an extended outage and the ability of your personnel when it comes to remote management of the failover facility.
Beyond this, also consider what service-level agreements (SLAs) you currently have that need to be preserved while operating in a failover state and whether a backup facility can meet those SLAs. Quite often when we're assessing a DR facility, we get myopic and think specifically of disaster-type SLAs.
We often neglect the operational SLAs we have in place internally and don't carry those over to the DR infrastructure when they may be crucial to continuing business as normal. It doesn't do you any good to maintain uptime and operations if your performance may still deteriorate to the point that you lose customers or sacrifice revenue.
- The relative cost of outages for systems in my enterprise
- The cost of protecting those systems
- The potential likelihood of a disaster
- The potential outage period from a disaster
I would use those criteria to evaluate systems supporting key business processes, all of the infrastructure systems that support key business processes, and then place them on a quadrant-like grid that would visually present the cost of outage and the cost of protection for each primary application in the enterprise.
By projecting the cost of an outage and considering those costs in the context of disaster risk, you may be able to better establish a threshold with senior management to what should be protected in your enterprise. So that's how I would begin assessing whether it would be critical for a day-to-day business system that enterprises normally consider less critical.
After that exercise is done, I would undertake it for general operational IT systems, like email and general infrastructure services. Then I would add one more element; the potential number of systems and business processes that are dependent upon those services.
In my view there are two aspects users should consider here: Is your disaster center isolated from a local disaster that may impact your primary location and is your remote center readily accessible in the event of an extended outage? If not, your team needs to be comfortable with remote management and you should be sure that all remote operations are supported with lights-out technology.
It's becoming increasingly practical to construct DR solutions in a cloud as well. This requires a larger shift in management for many organizations, but it can harness more cost-effective resources and be more flexible and scalable. More than likely this will require a fairly comprehensive shift to server virtualization.
The issue surrounding this approach is that services may be more difficult to guarantee with service-level agreements (SLAs). These services may be more subject to performance degradation from disasters impacting large numbers of customers. It may be very challenging trying to get a handle on what a cloud-based service provider's infrastructure is capable of; there may be unique security implications, and you may run into various compliance and regulatory hurdles depending on your industry when you start looking at hosting data in the cloud.
DR in the cloud is certainly viable. It may solve a lot of your concerns about the location of a data center because that will force you down this path of managing things in a hands-off way that can be run remotely with any type of personnel.
All of the major vendors have solutions and various mechanisms even exist at the post-operating system level and from backup vendors. I would go to your major solution provider partners and talk to them about their recommended approaches.
One other important aspect of failover and failback to consider is the assessment of whether your services are appropriately configured for synchronization and complete startup of a recovery environment. In general, the IT practitioner has relied heavily on the manual testing of failover and failback. But that exercise is surrounded with poor assumptions.
You're assuming that your prep for the exercise will help you mitigate issues that you would face during an unpredictable disaster. But there are discussions with end users that these exercises are often riddled with holes and failures or compromises. As a result, we're seeing an increasing number of vendors bring solutions to the table that will more holistically manage or evaluate the DR environment, including Simple Continuous and Continuity Software. These focus on managing your DR environment and setup, ongoing preservation and identifying what might be out of spec so that you can correct or automate mitigation of issues that might pop up.
Those solutions, in my view, are disruptive to the marketplace this year. They will have tremendous traction with end users and solve some big issues in large enterprises. But the value proposition is all the way down the food chain for small businesses that have a fairly complex environment to manage DR for.
Jeff Boles is a Senior Analyst with the Taneja Group.