In the modern IT world, data protection is an essential requirement for delivering business continuity in an IT disaster. Data is the lifeblood of all enterprises and a valuable asset that requires having efficient processes in place to ensure the business can access critical systems in a timely fashion. The cost of downtime can be thousands of dollars per hour depending on the type and size of the organization.
Disaster recovery (DR) was once seen as an “all-or-nothing” scenario -- the button was pressed because the company had experienced a major disaster in its IT services that were deployed on a monolithic infrastructure such as the mainframe. The traditional DR model was based on tape backup, with secondary backup tapes stored offsite. This model can incur significant downtime, as tapes must be retrieved before data and applications can be restored. Organizations that required faster restore times replicated data to in their own secondary facilities, or used shared services offered by DR specialists that provided on-demand recovery capabilities. However, these models were very expensive.
The Internet, virtualization and the evolution of public clouds has provided a much more practical opportunity for businesses of all sizes to implement a BC/DR plan without heavily investing in additional data center space. Operations can be moved to the cloud “on demand” as required, via a cloud-based disaster recovery service, either in a controlled fashion or as part of an unplanned emergency. As such, it is more appropriate to talk about business continuity as the process of ensuring IT services are continuously available, with disaster recovery being the process of migrating services to a secondary location.
Cloud DR strengths
Today, applications are likely to be much more widely distributed, running in virtualized environments or (in the future) on containers. This has changed the backup paradigm and shops have more flexibility to recover all or part of IT services where required. With a cloud-based disaster recovery service, businesses can:
- Provide continuity for their operational services, regardless of where they are delivered from
- Perform tactical failover to secondary services in the event of a hardware or software failure in some (or all) of their IT systems
- Perform controlled failover of workloads to enable maintenance of other components such as the network or the environmental infrastructure
- Migrate workloads to cope with unplanned demand or growth
- Test DR capabilities on demand with no impact to the primary systems
Of course, the primary focus of BC/DR is to meet the service-level agreements and objectives provided to the business. This means meeting RTO (recovery time objectives) and RPO (recovery point objectives) metrics on an application-to-application basis. Depending on an organization's specific RTO/RPO requirements, there are three main cloud DR models:
Data only. The DR process focuses on ensuring a backup copy of data is available on the cloud platform and represents the lowest level of recovery. This means protecting data such as that sitting on file servers, including home directories and shared folders. In the event of a disaster, the data can be accessed from the DR location in the cloud. Depending on the amount of data that must be restored, downtime can be significant and even require physically shipping data back to the primary site on an appliance to restore.
Application-based. The DR process focuses on replicating application data into the cloud to a secondary deployment of the application. Data is moved using native application capabilities or a third-party product. Failover consists of repointing access to the application running in the cloud (typically through DNS changes). The secondary application instance is running permanently in the cloud, receiving data on a periodic basis.
Virtual machine image. The DR process replicates an entire VM image, including data, to the cloud. The VM image itself is dormant (not running) until required, at which point it can be powered up and accessed, typically through DNS changes. VM image backup can also be used as a method of protecting physical (bare metal) application deployments through P2V replication.
Cloud DR issues
Of course, moving to a cloud-based disaster recovery service has issues. Many of the following examples could be experienced when deploying any DR system; however, some are more particular to cloud deployments.
Network bandwidth. Bandwidth is an issue from a number of angles. First, you must have enough throughput capability between the primary site and the cloud to ensure data can be replicated in a timely fashion without too much lag in concurrency (which affects RPO). Second, you need enough bandwidth available to recover changed data back to the primary site once the DR issue is over. Third, you must be able to access services from the cloud, either from the internal business network or from the Internet with client-facing applications.
Network security. Data moving to the cloud will be outside of the protection of the private network in the data center, so it must be encrypted in flight at a minimum. Compliance or other regulatory restrictions may require data to be encrypted at rest when offsite. This can have implications on how applications are implemented onsite, to ensure that the encryption process does not interfere with normal operations.
Network addressing. As application workloads are moved to the cloud, IP addresses will change. When primary and secondary application servers are kept onsite, IP addressing can be managed relatively easily, either through implementing a level 3 network between sites or by using routing. Moving an application to the cloud will require changes to DNS (to point to the new server/data location) and modification to the application itself in some cases.
Network latency. Running applications from the cloud rather than onsite may cause performance problems due to increased latency. This can occur if only part of a service is migrated into DR with issues experienced in intercommunication between on and offsite services.
Licensing. DR instances of applications require purchasing licenses, depending on the terms of the application vendor. These license options may be different for cloud implementations or, in the worst case, not supported.
Cost. The cost of implementing DR will include providing the cloud services, additional network capacity, licensing dedicated backup software and extra application licenses. All of these may vary depending on the way in which cloud DR is delivered.
Choices: DIY or buy?
Should you build the DR capability yourself or buy a cloud-based disaster recovery service or product? Data only-based DR can be implemented relatively simply, by copying data to a cloud-based file service, or using products such as Acronis Cloud Backup or Zetta's DataProtect.
Application-based DR can be achieved by creating a target virtual machine with a cloud service provider and implementing replication at the application level. Of course, the IT organization will be responsible for ensuring the DR VM image is suitably maintained (patched, upgraded) to keep in step with the production deployment. In addition, if failover is invoked, then the application teams will need to be involved to move data back after a failover. Certain database-based replication products do not support incremental replication of data back to the primary database instance (even if they technically work), which may present a problem.
VM replication provides the ability to move an entire application to the cloud as a virtual machine. This is a good idea when there are complex application/server dependencies, such as Microsoft SharePoint, as there is no need to build and maintain a separate VM image. Products are available from vendors such as Zerto which integrate into the hypervisor and replicate all I/O to the cloud instance. In a recovery situation, the cloud-based image is used to run the production service, following any configuration amendments, such as setting IP addresses and matching to the DR hardware specification.
With most block-based replication products, the cloud image can appear to be a crash-consistent copy of the application that can subsequently require extended recovery on startup. This is where the ability to test the recovery image becomes critical. Testing means bringing up the application in an isolated manner in disaster recovery mode that allows integrity checks to be performed without impacting the production application.
Comparing cloud DR services
What should buyers look for when reviewing a cloud-based disaster recovery service? Here are a few additional pointers:
- Cost basis. How is the service charged; per TB of storage or per VM? Are there additional charges for running in DR mode?
- Time limitations. How long can I run the service in DR mode? Are there any restrictions on how many systems I can fail over?
- How does failback work? Can I incrementally fail back to production (take back only the changes) or do I need to restore all my data?
- Does the service offer extended protection? If I am in DR mode, can I also replicate my data to a third copy until I return to production?
A cloud-based disaster recovery service offers flexibility for providing data protection to on-premises environments. As applications evolve, we will perhaps see the distinction between DR and dispersed applications start to blur, with DR providing both application protection and scalability. Whatever happens, the need to provide application resiliency and data protection will always remain.
Testing your DR plan with a cloud-based service
Evaluate your cloud-based disaster recovery service
What cloud disaster recovery services are out there?