There is no single uniform approach to disaster recovery planning. Each organization must establish plans and implement recovery strategies leveraging tools and technologies that are appropriate for its particular business model, recovery requirements and compliance obligations. Regardless of the specific approach, however, disaster recovery planning is not a one-time academic exercise. In actual practice, disaster recovery plans often necessitate changes in the storage infrastructure and impose other overhead tasks that must be addressed.
Disaster recovery plans must also be tested and updated periodically to ensure that disaster plans remain relevant as the business grows or the supporting IT infrastructure changes. In essence, disaster recovery management must become a process much like change management -- it must become part of the daily operations as companies must operate in a state of disaster avoidance and preparedness.
Below are four of the more pressing disaster recovery management issues.
Disaster recovery strategies typically involve changes to an existing storage or network infrastructure. Ultimately, a storage administrator must budget and schedule the possible hardware, software, implementation, training and facilities costs needed to accommodate the disaster recovery strategy. Hardware additions may have been as simple as adding a tape drive to an existing tape library just a few years ago, but now it may require more substantial additions like dedicated storage systems. One example might be the acquisition of a NearStor virtual tape library (VTL) from NetApp or a deduplication storage array from Data Domain.
In most cases, and based on best practices, backups intended for disaster recovery purposes are sent to a remote location. Services like those from Iron Mountain can transport physical tapes to a secure offsite vault, but an increasing number of organizations are adopting disk-based backups and are practicing remote replication between storage systems at two or more locations. For example, a bank may use a WAN link to replicate data from one EMC Centera in its main data center to a secondary Centera located in a backup data center across the state.
Disaster recovery strategies depend on software and usually involve one or more software applications, such as backup, snapshot, mirroring or replication tools. Some examples include EMC's TimeFinder software used to create local copies of data volumes referred to as Business Continuance Volumes (BCV). This storage array-specific technology is often used in conjunction with SRDF software designed to replicate Symmetrix DMX volumes to a remote location. NetApp's SnapShot, SnapMirror and SnapVault are other good examples of software products that can be combined as part of a disaster recovery strategy. There are also hardware independent replication solutions that allow end users to replicate between dissimilar storage arrays, such as Symantec's Replication Exec.
Whether software is bundled with the storage system or acquired separately, an IT staff must invest the time to become proficient with each tool. Smart managers will ensure that key IT personnel have the time to learn each tool.
Once the disaster recovery infrastructure is in place, it can take a significant amount of time to establish and maintain the initial backup or replica. This may require an evening or weekend to make full backup tapes or synchronize data between replication sites across a WAN. After the initial replication, an IT department must allocate the time to tackle incremental tape backups or nightly replication.
Organizations rely on backups to protect themselves from a disaster, but are the backups themselves vulnerable to a disaster? Whenever corporate data resides outside the direct control of an IT department, it is important to consider the implications of data security. The selection of any remote location should start with an evaluation of physical security.
Tape storage or remote data center equipment should always be kept under lock and key - accessible only to a minimum number of authorized personnel. Fire extinguishers and suppression systems should use gasses that are friendly to electronic equipment and digital media (water-based systems should be avoided). The geographic location should not be subject to flooding, earthquakes and other natural disasters. Man-made disasters, such as sabotage or acts of terrorism, must also be considered depending on the nature of the business. Feel free to inspect a remote facility in advance. If the facility is managed by another company, such as Iron Mountain, take the time to discuss the company's security and disaster plans, and define its liability regarding your vital data.
The data itself may need to be secured through encryption techniques. As a rule, only personally identifiable information must be secured, such as customer records with Social Security or credit card numbers), though organizations that replicate data often choose to encrypt all data in order to maintain security across an open WAN. Encryption can be handled through backup software or implemented through dedicated encryption appliances integrated into the network, such as Decru's DataFort.
However, before opting for data encryption, you should always measure the impact of this decision on other technologies you may have chosen to implement as part of a disaster recovery strategy. For example, data deduplication solutions lose most, if not all, of their data reduction capabilities when handling encrypted data.
Testing and training
Even the most elaborate disaster recovery plan is useless if it cannot be executed. An important part of disaster recovery management is periodic testing and training, bringing new IT personnel up to speed on the disaster recovery process and verifying that recovery is achievable within the specified recovery time objective (RTO).
Disaster recovery exercises can be very disruptive as they usually require that a portion of the production environment be brought offline to truly test the recovery procedures and supporting technology. Proper planning and special care must be taken to avoid creating a disaster while trying to test your preparedness to handle one.
Some organizations avoid lost production time and the risk of unexpected problems by leveraging an existing test or development environment. This provides an opportunity to test recoverability using the same means employed in the production network. While this tactic does not verify the actual recoverability of the production equipment, it does provide much needed practice for IT personnel. Drills often include discussion time for personnel to evaluate the plan and make any recommendations to streamline or improve the disaster recovery process.
There are no solid guidelines that dictate how often a disaster recovery plan should be tested, though once a year is probably the minimum frequency. In addition to regularly scheduled testing, additional testing can be accommodated as needed when personnel turnover occurs or when changes to the IT infrastructure require modifications to the disaster recovery plan. If your organization has a contract with a disaster recovery service provider, the contractual agreement typically includes testing time. This provides an opportunity to test disaster recovery in a nondisruptive fashion, away from the production environment. However, you usually need to schedule testing time well in advance. Organizations should also take into account that allocating IT resources to disaster recovery testing might take them away from their usually duties. Once again, proper planning is essential to avoid creating unnecessary exposures to the production environment.
Updating the plan
Finally, disaster recovery plans are never static. Changes invariably occur with storage resources, applications, IT personnel and even business processes or corporate entities, such as mergers or acquisitions. As changes take place, the disaster recovery plan must be updated to reflect those changes. For example, if 2 TB of storage capacity is added or a new storage array is installed, that additional storage must be included in the disaster recovery cycle. As another example, new privacy legislation may require files to be encrypted where they may not have been encrypted in the past.
Changes can also have an adverse effect on the disaster recovery strategy. Consider the same 2 TB of additional storage capacity in the previous example. Since more storage will likely result in longer backup times, it may be necessary to consider a different backup technology or increase the WAN bandwidth to maintain acceptable RTOs and recovery points objectives (RPO) for replicated data. Whatever the case, an organization's change management process must include disaster recovery planning to ensure that any IT change that could potentially affect recoverability is captured and addressed before it is implemented. Change management should also ensure that disaster recovery is included in the early development stage of any new application and supporting infrastructure.
Pierre Dorion, contributed to this disaster recovery management article. Peter is the data center practice director and a senior consultant with Long View Systems Inc.