This content is part of the Essential Guide: A DRaaS market guide: Advice on the thriving technology

Cloud data recovery: Documentation is the key to success

In creating a plan for data recovery in the cloud, it's important to keep track of your infrastructure, DR requirements and the potential duration of failover.

The public cloud provides a great opportunity for IT departments to implement a business continuity/disaster recovery...

plan without having to go to the expense of building out a dedicated data center. With a cloud data recovery system in place, the cloud can be used as a basic data repository or even as a location to run applications when primary systems go down.

When building a DR plan, the first step is to look at the applications used to deliver IT services and determine what needs to be protected if a disaster occurs. This means creating an inventory of applications and the services they need to run. Many organizations have moved to virtualization as their core server deployment model; however, physical servers still need to be considered. A comprehensive cloud data recovery plan should document the following:

  • Physical and virtual servers used to deliver infrastructure. These can include Active Directory (AD) servers, DNS/DHCP servers and appliances.
  • Physical servers used to deliver applications. There may be good reasons why services are still delivered on physical servers; this may include scaling and performance requirements or the use of custom hardware and operating systems. However, cloud recovery services may provide the opportunity to virtualize some of these elements.
  • Virtual servers used to deliver applications. There may be dozens or hundreds of virtual machines used to implement applications. Each needs to be identified and inventoried, looking at storage, memory and virtual processor requirements.
Setting the right cloud data recovery requirements involves having a conversation with the application owners in the business, as they will know how important their applications are.

It's a good idea to identify infrastructure servers ahead of time, as these systems may need to be brought up first if a disaster strikes. It's possible to preconfigure AD, DNS and DHCP services to be running in the cloud and synchronized with their on-site equivalents, making the DR process easier and quicker to implement.

Understanding the network configuration is critical to getting DR in the cloud to work successfully. This means spending time understanding the interdependencies of applications at the network layer, including any security and firewall configurations. Good cloud data recovery questions to ask include:

  • Are any of my applications or servers latency-dependent on each other?
  • Are there any East-West firewall rules in place to manage on-site traffic?
  • What are my external bandwidth requirements for customer-facing applications?

More in part 2

This is the first part of a two-part series. In the second part, check out the considerations for picking a cloud recovery provider and implementing the service.

Determine cloud data recovery requirements

It's impractical to assume that every application needs to be immediately recovered in the event of a disaster. Instead, applications should be prioritized against a set of criteria to determine how quickly and with what concurrency systems and data need to be brought back into operation. There are a number of standard metrics that should be used when determining the service levels for recovering applications:

  • Recovery time objective. This is a measure of how much outage time can be tolerated before the application is back up and running; it is typically measured in minutes or hours. An RTO of zero, for example, represents no tolerance for outage, whereas an RTO of one hour means the application must be recovered within one hour from the point of the DR incident.
  • Recovery point objective. This is a measure of how much data loss can be tolerated once the application is once again running. An RPO of zero indicates that all data must be recovered to the point at which the disaster occurred, whereas an RTO of 24 hours means the data or system recovered may be up to 24 hours out of date.
  • Service-level objective. SLOs measure the aims for overall application recovery. For instance, an agreement might be in place to recover 90% of applications within four hours. A tighter SLO requires more infrastructure and potentially more people to achieve, so having some flexibility can help to manage DR costs.

SLOs allow the value of data and applications to be prioritized. For example, an online credit card processing system will have an RPO of zero and a very low RTO. It's reasonable to expect that these kinds of systems never lose data. At the other end of the scale, reporting applications may be able to tolerate being 24 to 48 hours out of date as their data is extracted from other applications. Other systems will probably sit somewhere in between these two extremes.

Setting the right cloud data recovery requirements involves having a conversation with the application owners in the business, as they will know how important their applications are. From experience, business owners tend to assume all their applications are important -- until they are given an idea of the cost of recovery. So it's good to place some kind of cost measurement on them.

One final thought on service levels: Some stringent requirements, like RPOs of zero, can't be delivered with cloud-based DR due to latency between on-site and cloud locations. These applications may need to be excluded from cloud-based DR and provided with a more custom DR product.

How long will the DR service run?

One final area to document is how long services may run in the public cloud. Making this decision will depend on the type of incident that occurs. Not all disasters result in the total loss of on-site capability. A sliding scale of incident types will exist, such as:

  • Server loss. Either a physical server or a virtual server host. Losing a virtual server host may mean some, but not all, applications need to move to run in DR mode.
  • Multisystem loss. Multiple applications may be lost if, for example, a shared storage array suffers an outage.
  • Data center loss. In the worst-case scenario, the entire data center is lost or inaccessible. All services have to be run in DR mode.

In some scenarios, services may move for a few hours or days. In the event of a total site loss, the requirement may be to run DR services for weeks or months until the original facilities are rebuilt. Cloud recovery services will charge for the time that live services are used, so this is an important consideration when choosing a DR service.

Next Steps

Weigh your cloud data recovery options

How cloud DR service providers track data

Cloud disaster recovery can be an affordable recovery method

Dig Deeper on Disaster recovery planning - management