Does your disaster recovery plan include contingencies for service provider outages? We know on an intellectual...
level that every computer system experiences outages. But we sometimes need to experience an outage before we understand the issues at a more visceral level, and plan properly.
Were you able to enact your disaster recovery (DR) plan during the Amazon Simple Storage Service (S3) outage in February 2017? Even if your DR plan uses a different cloud service, there are still lessons to be learned from the Amazon Web Services outage. In particular, you need to be aware of the service-level agreement for each element of your DR plan, especially those outside of your control.
What went wrong
The Amazon Web Services outage was a fairly simple one. An AWS engineer doing routine maintenance typed a command incorrectly. As a result, the AWS infrastructure that manages and monitors S3 did not operate as it should have. It appears that any application using S3 in the US-East-1 region was unable to create new objects.
For DR applications, the outage meant that new backups could not be stored, possibly breaching customer recovery point objectives (RPOs). DR applications also could not do any restores from existing backups, impacting the ability to achieve recovery time objectives (RTOs).
It took approximately six hours for AWS to fully restore service. According to AWS, S3 targets 99.9% availability in a given month, which allows for a little less than 44 minutes of downtime monthly. Clearly, AWS should pay back some of its service fees, as it looks like the provider only hit 90% availability that month. But that would be little comfort if you had a DR event during the Amazon Web Services outage. You would have had to wait until the outage was over before you could start your recovery using the last completed backup.
How to make it right
The first lesson following the Amazon Web Services outage is that cloud services are beyond your control. Being aware of the available service level will enable you to make the business decision of whether a particular cloud service fulfills your DR needs.
The probability of a cloud provider failure occurring at the same time that your primary data center fails is low. A quick Google search suggests there have been around three significant S3 outages since the service launched in 2006. It seems to me that the network link between your data center and AWS is more of a risk to your RPO/RTO. Are these risks documented in your DR plan? Does using disaster recovery as a service (DRaaS) still make business sense?
There are options, such as using more sites, if the outage has made management skittish about DR in the cloud. For example, winter storms in the US-East-1 (northern Virginia) region wouldn't impact the EU-West-1 (Ireland) region. By replicating the S3 bucket with your backups from US-East-1 to EU-West-1, or having your backup application send backups directly to both regions, you should be immune to the failure of an AWS region.
You might even choose to implement an S3-compatible storage system at a remote office, and also have your backup software write to that location.
For the truly mistrustful user, you can send backups to two different cloud providers with completely separate infrastructures. The downside is that sending backups to two locations means paying for far more storage and network transfers. Then there is the need to manage multiple DR plans, one for each location where you replicate data. A little math will probably identify that the extra cost isn't worth the extra availability.
Any computer system can, and will, have downtime. A cloud-based DRaaS is no exception. Does your business understand the possible impact to continuity if your DR is impacted by a cloud failure, like the Amazon Web Services outage?
While the majority of businesses will not want to increase their spending to make their DR more available, a tiny number of businesses will need the best possible DR posture despite the costs.
How much will the AWS outage change disaster recovery habits?
Explore another famous outage and disaster recovery tips
Can IT still count on cloud service providers for availability?