It's no secret that building up effective disaster recovery (DR) capabilities is peppered with challenges. From identifying business requirements and mapping them to non-budget-busting technology solutions to coping with the operational impact of disaster recovery planning and testing,
At the same time, surveys highlight that the increased impact of outages and business demand for improved uptime places more pressure on IT to do something to improve the state of disaster recovery planning within their organization(s). On the bright side, technology options for DR are increasing and becoming more and more cost-effective. The widespread adoption of server virtualization, an increased variety of options for networked storage and lower bandwidth costs are encouraging more organizations to renew their efforts to improve their DR practices. However, when trying to improve disaster recovery, there are steps that newcomers may miss and even old-timers may not adequately consider. So I've provided some of the disaster recovery best practices to help you avoid DR interdependency predicaments.
Prioritize disaster recovery applications
One of the primary challenges with disaster recovery is dealing with its inherent interdependencies. The coordination required between the various functional areas of IT requires an end-to-end perspective -- disaster recovery is only as effective as its weakest link. It does no good to replicate storage or recover data if servers and applications are not available, and nothing gets done without a functioning network and the availability of necessary functions like name and directory services (e.g., DNS, LDAP, Active Directory).
It is helpful to consider this aspect of interdependency in the context of disaster recovery best practices. DR planning processes typically include a prioritization of applications based on business criticality leading to the establishment of formally defined recovery tiers, with tier 1 representing the most mission-critical applications. However, there are a set of applications, or more accurately, services, including the aforementioned networking and name services, which must be operational prior to any recovery of tier 1 functions. This tier, which we can refer to as tier 0, also includes other needed prerequisite services like communication applications such as voice mail, IP telephony, Blackberry Enterprise Server, recovery components such as data backup servers, security, and network monitoring and management tools. Surprisingly, this last area is often overlooked in the planning process because while monitoring tools are critical to IT, they aren't typically thought of as "business critical" in the same way as perhaps ERP or order entry, and can be overlooked by DR planners. I recently encountered an environment using 27 different monitoring applications in their network operations center (NOC), but no DR plan existed because they weren't on the tier 1 list.
Don't overlook RTOs and RPOs
When selecting disaster recovery methods, the first criteria to consider is recovery time objectives (RTOs) and recovery point objectives (RPOs). These metrics largely determine the acceptable methods of recovery.
For short timeframe RTOs or RPOs, data replication is far and away the predominant method of choice. This brings us to another frequently overlooked aspect of interdependency that can directly impact the ability of organizations to meet their RTO and RPO targets: application data interdependency. Many have discovered, often at the worst possible time, that although they technically have all of their data replicated to a DR location, it is not readily usable by applications because it is not in sync. As a result, database administrators and application specialists need to spend additional hours, sometimes days, reconciling data and rolling databases back to bring the various data components into alignment. By the time this effort is complete, the desired recovery window has long since been exceeded.
Keep up with data replication
Why does this problem occur and how can it be avoided? If all data replication activities were synchronous, this would not be much of an issue. However, most replication is asynchronous, meaning that interdependent replicated data sets may very well be in different states of consistency at the remote location. With enterprise storage replication, the problem of application data consistency is well understood and can be addressed by using consistency groups.
Enterprise replication technologies such as EMC Corp.'s SRDF and Hitachi Data Systems' True Copy support the notion of consistency groups where the storage LUNs of group members are replicated in timestamp order to ensure consistency with one another. Not all storage replication technologies support consistency groups and most host-based options are not designed around this notion. Simple, single server applications may not require this functionality, but for any organization seeking to replicate complex or highly interdependent applications, it is essential for quick recovery. Therefore, it's important to fully understand your data replication requirements -- particularly application data dependencies -- before selecting a data replication approach.
On the DR planning front, there can often be application interdependency challenges across recovery tiers. At face value, a given application may be classified as tier 2 in terms of business criticality, but if a tier 1 application is in any way dependent upon it, then it must be reclassified as tier 1 from a recovery perspective.
But it doesn't end there. For environments that are already employing technologies that support consistency groups, it is critical to ensure that consistency group management is well-integrated into change management processes. For example, if a LUN is added to expand a replicated database, it must be added to the associated consistency group. Failure to do so will result in an inconsistent copy at the remote location and added hours of recovery effort.
Effective disaster recovery requires diligence and attention to detail. The interdependency problems described here are easily avoidable through a regular and robust DR testing program. Unfortunately, DR testing remains an area of weakness for many organizations, so these examples may simply server to underscore the risk. The old adage "an ounce of prevention is worth a terabyte of cure" definitely still applies.
James Damoulakis is CTO of GlassHouse Technologies, the leading independent IT infrastructure services provider. He can be reached at firstname.lastname@example.org.
This was first published in December 2009