Why you should have a disaster recovery testing plan in place
A comprehensive collection of articles, videos and more, hand-picked by our editors
Disaster recovery has become tougher due to ever-changing virtual environments. Ensure your disaster recovery plan testing runs smoothly with the help of DR monitoring tools.
Being able to recover from a disaster is consistently a top priority for IT managers. They're constantly looking for ways to protect more applications, and to do it more economically with less downtime. But even with sustained investment, there's still an alarming lack of confidence in how well these processes will perform when a real disaster event occurs.
One of the most ambitious projects an IT department will ever embark on is the creation of a disaster recovery (DR) plan. But IT professionals need to understand that creating the plan is only the first step in the process. No matter how carefully crafted it is, a DR plan has no value if it doesn't work when needed or if only a subset of the protected data can be recovered and recreated. It's important to understand that in addition to developing an adequate DR plan, a strictly adhered to change control process must be implemented so that changes in the environment can be reflected in the plan. Yet the reality of the modern data center is that change typically happens too fast for a change control process to keep up with it. Even if change control is adhered to most of the time, one small misstep or slip up can result in recovery failure.
Four disaster recovery monitoring must-haves
1. Environment awareness. Disaster recovery (DR) monitoring tools must go beyond application awareness and understand the environment so that changes to the application's specific environment are detected and reported.
2. Hardware and software independence. DR monitoring tools should work across a variety of applications and storage hardware to analyze for inconsistencies.
3. Monitoring only. DR monitoring tools don't have to actually move data -- there are numerous hardware and software vendor products that do that. DR monitors should therefore complement those solutions, not compete with them.
4. Work from a knowledgebase. DR monitoring tools shouldn't depend on collecting information from devices for their information. They should develop their own list of best practices that's used to check for DR gaps.
The proof is in the testing
Disaster recovery plan testing is critical to identifying changes in the environment so that the plan can be updated or modified to include any new situations and to accommodate any altered conditions. Despite the importance of DR plan testing, full-scale tests can only be done periodically because they're time consuming and often expensive to conduct. In reality, partial testing is more likely with a quarterly frequency at best; many businesses only do a full-scale test once a year.
Many businesses also have the added burden of multiple disaster recovery locations typically driven by legal or compliance regulations that often require geographic distances between source and recovery sites. That means each DR data center should conduct its own standalone DR test, and can potentially make the gaps between various DR sites and the primary site even greater.
The problem is that in between DR tests, many configuration changes take place in a typical data center, often happening very rapidly. As a result, IT planners are looking for ways to monitor and validate their disaster readiness in between full-scale tests. DR monitoring tools are able to audit processes such as clustering and replication to ensure these systems capture all the data they need and store the redundant data copies correctly.
Configuration drift is the root of the problem
When a disaster recovery process like replication is first implemented, it's installed into a known, static application state. The volumes have all been created and configured, and they can be easily identified by the replication application so that it can protect them. But as the application evolves, new volumes may be added so that more host servers can be supported. Or perhaps a volume gets moved to a different storage system so that performance can be improved, such as moving log files to an all-flash array. These additions or changes are often not reported to the IT personnel in charge of the disaster recovery process and, consequently, are left out of the protection process. This is a condition referred to as configuration drift, whereby the DR infrastructure falls out of synch with the production environment. This is such a common problem that industry analyst firm Enterprise Strategy Group states that six out of 10 recovery operations fail due to configuration drift.
The configuration changes will typically be discovered during the next DR test and can be corrected then. But if a disaster occurs before the next scheduled test, data loss is likely to occur, as well as a failure to return the application to proper operation. In other words, every time a configuration change is made to an application, a DR test should be planned to make sure all the changes have been mapped into the DR process. In the real world, however, most IT budgets can't support the expense of such frequent DR tests, and the IT staff is stretched far too thin to execute tests so frequently.
The issue of configuration drift is even more common in today's highly virtualized data center. Thanks to abstraction at the host, network and storage levels, change is very easy to implement and can easily go unnoticed.
Storage systems track themselves
Storage vendors may include configuration change tracking with the monitoring software they sell alongside their hardware. But it's too myopic, and typically only reports on their specific hardware, not the mixed environments common in enterprises. So they'll likely fail to detect changes that could jeopardize an effective recovery.
DR monitoring apps can cut testing time
DR monitoring tools such as Continuity Software's RecoverGuard 4.0 and Symantec's CommandCentral Disaster Recovery Advisor can help close the gaps between configuration drift and DR readiness. These tools are software applications that conduct daily scans of the production environment to look for coverage gaps and other areas of potential exposure. The key is that they're proactive watchdogs of the DR monitoring process, which is far more effective than quarterly or annual DR checks of a change management plan.
These tools can monitor DR at the application level and understand when new volumes have been added to that application or when an application has had data moved to another volume. When the monitoring app detects one of these situations, it can alert storage administrators via an email or even by opening a trouble ticket in their help desk software.
The tools have recently added support for virtual infrastructures as well. As noted earlier, the virtual layer makes it more difficult to recognize errant configurations because the abstraction hides the physical hardware from them. It also makes changes to the environment -- like the addition of new servers -- harder to detect. In other words, there's no physical server being installed that a DR manager might notice. For example, many databases have a requirement to separate data files from log files on different disk volumes; a virtualized environment may conceal this from a manual check.
In addition to confirming that the right data is replicated to the remote location, DR monitoring tools can validate the equipment configurations at those locations to ensure they're compatible with the application they may have to support. DR monitoring tools can also detect how far out of sync the disaster recovery site is, measuring how well the recovery site duplicates the primary site's data. Thresholds can be set by the IT team that would trigger an indication, for example, if the secondary site's data copy is older than the primary site data. In such a case, IT would be notified of the problem.
Knowledge is power
The key to DR monitoring tools is their built-in knowledgebase. Instead of scanning hardware and hoping they provide accurate error detection, these applications are based on their own proprietary databases of best practices and configurations. The tools use their database of information to verify various configurations for storage, networking and WAN segments. The capabilities of these knowledgebases are expanding to the point that part of their trouble-ticketing capability now includes the ability to provide root cause analysis. In other words, these products can not only alert IT staff of a potential problem, they can provide specific advice as to how to best remedy the situation.
Companies like Continuity are leveraging their knowledgebase expertise to expand beyond DR monitoring into full SAN management. Where most products in the past have focused on performance diagnostics, Continuity focuses on risk detection. This step into the SAN is also very complementary to DR monitoring; after all, if the SAN itself is misconfigured it will likely impact the disaster recovery plan downstream.
One of the downsides to a knowledgebase-driven product is that it has to support the specific applications, environments and physical hardware in the environment. It's important to ensure that the DR monitoring app you're considering supports the crucial components in your environment. This means choices may be more limited when upgrading to future hardware platforms since you'll want the DR monitoring software to provide equal support. Finally, if your organization likes emerging technology and wants to be on the cutting edge, then a knowledgebase-driven DR monitoring tool may not be the right fit as it's unlikely the tool will support new products until they have an established customer base that justifies inclusion into the knowledgebase.
Bottom line on DR inspection tools
DR planning is never a one-time event; it's a constant process that has to keep up with evolving service-level agreements and changes in the environment. Given the realities of a rapidly changing data center, it's almost impossible for change control processes to keep up, and it's equally difficult to conduct DR tests with enough frequency to be meaningful. As a result, most companies, especially large enterprises, should consider disaster recovery monitoring tools that allow for the near-real-time analysis of the DR setup and processes. This includes the primary data center SAN and its reciprocal data center that will be counted on in the case of a disaster.
About the author:
George Crump is president of Storage Switzerland, an IT analyst firm focused on storage and virtualization.