Disaster recovery (DR) is an area of security planning that aims to protect an organization from the effects of significant negative events. Having a disaster recovery strategy in place enables an organization to maintain or quickly resume mission-critical functions following a disruption.
A disruptive event can be anything that puts an organization's operations at risk, from a cyberattack to power outages and equipment failures to natural disasters. The goal of DR is for a business to continue operating as close to normal as possible. The disaster recovery process includes planning and testing and might involve a separate physical backup site for restoring operations. An emergency communication plan is another part of a disaster recovery strategy, enabling an organization to contact staff and relevant emergency response personnel and keep them updated.
Modern disaster recovery has numerous options, including more budget-friendly routes for smaller organizations that are wary of investing funds into planning for a theoretical disaster. Although disaster recovery planning is centered around a disruptive event that hasn't yet occurred, it's often better to be safe than sorry and have a strategy in place. Otherwise, the organization risks being caught without a plan at a critical moment.
Elements of a disaster recovery plan
Disaster recovery plans differ by organization and industry, with different compliance requirements and expectations. However, there is a general outline that plans should follow.
This article is part of
According to independent consultant Paul Kirvan, the components needed in a DR plan include:
- A disaster recovery policy statement, plan overview and main goals of the plan.
- Key personnel and DR team contact information.
- Description of disaster response actions immediately following an incident.
- A diagram of the entire network and recovery site.
- Directions for how to reach the recovery site.
- A list of software and systems that admins will use in the recovery.
- Sample templates for a variety of technology recoveries, including technical documentation from vendors.
- Tips for dealing with the media.
- Summary of insurance coverage.
- Proposed actions for dealing with financial and legal issues.
- Ready-to-use forms to help complete the plan.
According to Kirvan, the development team should include the following activities when creating the DR plan:
- Meet with the internal technology team and network admin to establish the scope of the plan, and then brief senior management on the meeting.
- Gather all relevant network infrastructure documents.
- Identify the most serious threats and vulnerabilities to the infrastructure.
- Review the previous history of outages and disruptions and how the business handled them.
- Identify the most critical IT assets and determine their maximum outage time.
- Identify the disaster response team and its capabilities.
- Have management review the plan.
- Test the plan and update it if necessary.
- Schedule the next review/audit of disaster recovery capabilities.
An organization should consider its disaster recovery plan a living document. The DR plan needs scheduled reviews and updates to ensure it's accurate and will work if a recovery is required. The plan should also be updated whenever there are changes in the business that could affect disaster recovery.
Difference between disaster recovery and business continuity
Business continuity and disaster recovery (BC/DR) often go hand in hand, but despite overlap, they are not the same. Both business continuity and disaster recovery play key roles in a data protection strategy and come with their own requirements and strategies.
Although disaster recovery focuses on an organization getting back on its feet after a disruption or failure, business continuity planning is centered on keeping things running while a disaster occurs. There are many motivations for having an effective business continuity plan, from compliance requirements -- if the organization is responsible for data that must be readily available -- to protecting the reputation of the business. Although something unavoidable such as a natural disaster might mean expected downtime, the less downtime the better. If an organization is unlucky enough to undergo a cyberattack, it will already be dealing with some trust being lost, so downtime is even less acceptable.
Both disaster recovery and business continuity involve planning not only for technical troubles, but physical issues as well. If a data center undergoes a disruptive event, BC/DR plans should include potential remote working locations and procedures for staff, both during an event and after it takes place, if the primary location needs time for repairs.
All organizations should be concerned with disaster recovery, but a business continuity plan should be considered a high priority as well.
The importance of disaster recovery: RPO and RTO
A disaster can have a devastating effect on a business. Studies have shown that many businesses fail after experiencing a significant data loss, but DR can help.
RPO is the maximum age of files that an organization must recover from backup storage for normal operations to resume after a disaster. The recovery point objective determines the minimum frequency of backups. For example, if an organization has an RPO of four hours, the system must back up at least every four hours.
RTO is the maximum amount of time, following a disaster, for an organization to recover files from off site and local backup storage and resume normal operations. In other words, the recovery time objective is the maximum amount of downtime an organization can handle. If an organization has an RTO of two hours, file recovery can't take any longer than that.
RPO and RTO help administrators choose optimal disaster recovery strategies, technologies and procedures.
Meeting tighter RTO windows requires positioning secondary data so that admins can access it faster. Recovery-in-place is one method of restoring data more quickly. This technology moves backup data to a live state on the backup appliance, eliminating the need to move data across a network. It can protect against storage system and server failure. Before using recovery-in-place, an organization must consider the performance of the disk backup appliance, the time needed to move data from a backup state to a live state and failback. Because recovery-in-place can take up to 15 minutes, an organization might need to perform replication if it wants a quicker recovery time.
Preparing for a disaster requires a comprehensive approach that encompasses hardware and software, networking equipment, power, connectivity and testing that ensures DR is achievable within RTO and RPO targets. Although implementing a thorough DR plan isn't a small task, the potential benefits are significant.
Disaster recovery planning and strategy
A disaster recovery plan provides a structured approach for responding to unplanned incidents that threaten an organization's IT infrastructure, including hardware and software, networks, procedures and people.
The plan provides step-by-step disaster recovery strategies for recovering disrupted systems and networks to minimize negative effects to operations. A risk assessment identifies potential threats to the IT infrastructure; the DR plan outlines how to recover the elements that are most important to the organization.
Free IT DR template
SearchDisasterRecovery's free, downloadable IT DR template will help facilitate the initiation and completion of an IT DR plan.
Disaster recovery testing
Testing is critical to change management in DR planning, helping to identify gaps and providing a chance to rehearse actions in the event of a crisis. A DR plan has a lot of moving parts, so testing it can help the organization understand exactly what employees should be doing during disaster recovery scenarios.
Free BC testing template
SearchDisasterRecovery's free, downloadable business continuity testing template will help organizations build and execute their DR tests.
An organization should have a schedule for testing its disaster recovery policy and be wary of how intrusive it is. DR testing too frequently can be draining on personnel, but organizations on a less frequent schedule often delay testing further. In addition, an organization should test its DR plan after any system changes.
One way of testing is to run in disaster mode for a period; for example, failing over to the recovery site, letting the systems run there for a week and then failing back.
Ways to get the most out of disaster recovery testing include:
- Secure management approval and funding for the test.
- Provide detailed information about the test.
- Make sure the entire test team is available on the planned test date.
- Ensure your test doesn't conflict with other scheduled tests or activities.
- Confirm test scripts are correct.
- Verify that the test environment is ready.
- Schedule a dry run of the test.
- Be ready to halt the test if needed.
- Have a scribe take notes.
- Complete an after-action report about what worked and what failed.
- Use the results from the test to update the DR plan.
Although it's optimal to perform a comprehensive disaster recovery test, this might not always be possible because of a lack of funding, time or resources. In that case, an organization should still bring together the key participants, distribute all the relevant documents and perform a walk-through of the test. There are risks to this scaled-down DR testing approach, as technology that hasn't been thoroughly tested might not work properly when needed.
Cloud disaster recovery/disaster recovery as a service
Disaster recovery as a service (DRaaS) is a cloud-based DR method that has gained popularity in recent years.
Positives of DRaaS include lower cost, easier deployment and the ability to test plans regularly. Cloud storage services save an organization money by running on a shared infrastructure. They are more flexible, as organizations can sign up for just the services they need. DR tests can be completed by simply spinning up temporary instances.
But cloud-based disaster recovery might not be available after a large-scale disaster, as there might not be enough room at the DR site to run every application. Cloud DR also increases bandwidth needs and could degrade network performance with more complex systems. Costs vary widely among vendors -- some charge based on network bandwidth consumption or storage consumption -- and can add up quickly.
Before choosing a provider, an organization should conduct an internal assessment to determine its disaster recovery needs. Questions to ask a potential DRaaS provider include:
- Will DRaaS work based on the existing infrastructure? How will the product integrate with existing backup and DR platforms?
- What percentage of the provider's customers can be supported simultaneously during a regional disaster?
- What happens if the provider can't supply a disaster recovery service?
- How will users access internal applications?
- How long can a customer run in the provider's data center after a disaster? What are the failback procedures?
- How much help can be expected from the provider during a disaster?
- What is the process for testing?
- Does the product offer scalability?
- Exactly how does the provider charge for its disaster recovery service?
In most cloud recovery situations, an organization should plan on failing workloads back to the original location as soon as the crisis is resolved. However, some DRaaS providers don't support automated failback.
Disaster recovery sites: Hot, warm and cold
At a DR site, an organization can recover and restore its technology infrastructure and operations when its primary data center is unavailable. DR sites can be internal or external.
An organization sets up and maintains an internal disaster recovery site. Organizations with large information requirements and aggressive RTOs are more likely to use an internal DR site, which is typically a second data center. Among the considerations in building an internal site are hardware configuration, supporting equipment, power maintenance, heating and cooling of the site, layout design, location and staff. An organization might want to perform a risk assessment of the recovery site as if it is the primary data center.
The internal site option is often much more expensive than an external site, but a major advantage is control over all aspects of the disaster recovery process.
An outside provider owns and operates an external disaster recovery site. External sites can be hot, warm or cold.
- Hot site: A fully functional data center with hardware and software, personnel and customer data, typically staffed around the clock; operationally ready in the event of a disaster.
- Warm site: An equipped data center that doesn't have customer data; an organization can install additional equipment and introduce customer data following a disaster.
- Cold site: Has infrastructure to support IT systems and data, but no technology until an organization activates DR plans and installs equipment; sometimes used to supplement hot and warm sites during a long-term disaster.
Distance is a key element of a disaster recovery site. A closer site is easier to manage, but it should be far enough away that it's not impacted by a major disaster affecting the primary data center. Sites farther away might require more staff and drive up costs.
A cloud recovery site is another option. Cloud storage is often cheaper and requires fewer resources and infrastructure, but admins must be mindful of bandwidth and security.
Regarding sites, an organization should consider site proximity, internal and external resources, operational risks, service-level agreements and cost when contracting with disaster recovery service providers.
Tiers of DR
In the 1980s, the Share Technical Steering Committee, working with IBM, presented a description of disaster recovery service levels using tiers 0 through 6. Tier 0 represents the least amount of off-site recoverability and tier 6 represents the most.
- Tier 0: No off-site data. Recovery is only possible using on-site systems.
- Tier 1: Physical backup with a cold site. Data, likely on tape, is transported to an off-site facility that doesn't have the necessary hardware installed.
- Tier 2: Physical backup with a hot site. Data, likely on tape, is transported to an off-site facility that has the necessary hardware installed to support key systems of the primary site.
- Tier 3: Electronic vaulting. Data is electronically transmitted to a hot site.
- Tier 4: Point-in-time copies/active secondary site. Vital data is copied across the primary and secondary sites, each site backing up the other. Diskis often used in this tier.
- Tier 5: Two-site commit/transaction integrity. Data is continuously transmitted across sites.
- Tier 6: Minimal to zero data loss. Recovery is instantaneous, often involving disk mirroringor replication.
A tier 7 was later added to include automation, and it represents the highest level of availability in disaster recovery scenarios.
In general, although the ability to recover improves with the next highest tier, costs also increase.
Types of disasters
There is a wide range of disasters -- caused by both humans and nature -- that lead to recovery situations. A certain type of disaster might seem improbable, but it's important to recognize the possibility of it occurring for disaster recovery purposes.
Examples of types of disasters include:
- Application or virtual machine
- Communication failure.
- Chassis failure, which can cause a single host or multiple hosts to fail.
- Rack failure.
- Data center disaster ranging from inadvertent triggering of a sprinkler system to a power outage to a flood or fire.
- Building disaster.
- Campus disaster; for example, a tornado that destroys one area.
- Citywide disaster.
- Regional disaster. Examples include Hurricane Katrina and Superstorm Sandy.
- National disaster. This is more likely in very small countries, but not impossible in larger nations.
Recognizing that these disasters exist is the first step in planning for them. There are two reports that can help prepare an organization for potential disasters: a risk assessment and a business impact analysis (BIA).
A risk assessment is conducted to identify hazards that could negatively affect an organization and provide ways to reduce the damage. Risks vary by factors such as the industry the organization is in and its geographic location, so it's critical that the disaster recovery planning process includes conducting a risk assessment. Generally, a risk assessment should be performed by identifying potential hazards, determining who or what these hazards would harm and using the findings to update procedures to take these risks into account.
A BIA determines and evaluates the effects of a disaster to business operations. A business impact analysis can help predict costs of a disaster striking, both financial and non-financial. A BIA looks at the effect of different disasters on an organization's safety, finances, marketing, business reputation, legal compliance and quality assurance. Conducted prior to a risk assessment, a BIA identifies crucial parts of the organization and can help form RPOs and RTOs in the DR plan.
Disaster recovery vendors
Disaster recovery vendors can take many forms, because DR is more than just an IT issue. DR vendors are made up of not only recovery software vendors and DRaaS providers, but organizations that deal with incident response and emergency planning.
Major vendors in the DR software and DRaaS market include Dell EMC, IBM, Veeam, Acronis, Zerto, Commvault, Arcserve and Azure Site Recovery, among many others. Emergency communication vendors are also a key part of the recovery process, and include Everbridge Crisis Management, Cisco, Rave Alert, AlertMedia and BlackBerry AtHoc.
Although some organizations might not be willing to invest in DR, that attitude is changing. Thanks to a growing disaster recovery and incident response market, organizations of all sizes should be able to work disaster recovery into their budgets.