Google said it well: "Resilience: designed to withstand the unexpected." This definition implies that no failure...
of a single instance of an application can close down operations, which means apps should be capable of running in a multi-instance manner. This requirement can stymie older legacy apps, especially those created using Cobol code, but whole classes of apps are already running fine this way, including web servers, media delivery servers and databases.
Generally, apps that can run multiple instances are easy to port to the cloud. This delivers a first level of resiliency, but it can't guarantee nonstop operation and the ability to ride through network operations center outages and natural disasters. Evading those problems takes cloud resiliency to the next level, with geodiversity creeping into the equation.
The need to run parts of an app instance set in different physical locations isn't a trivial problem. Let's step back a little and see what we are trying to achieve.
How outage differences affect cloud resiliency
When instances in a single location are operating, the loss of an instance triggers the creation of a replacement, which can be on stream in a minute or so. Data is all local, so recovery means building the instance, connecting virtual LANs (VLANs) and pointing the instance at the right data. It may even be possible to recover a transaction in process and take it to completion.
When instances are distributed geographically, the failure mechanism presents differently. Perhaps a quarter, or even more, of the working instances go down. That's complex enough to manage, but the next question is, where is the data needed for the new instances about to be created in distant data centers? There are also networking issues such as zone-aware domain name systems (DNS) and the repointing of load balancers and virtual switches.
How the outage is dealt with also depends on the expected duration of the problem. A router failure might be fixed by a reboot in a minute or two, while recovering from a lightning strike on a major power subsystem could take hours or days. There was a celebrated AWS service event outage several years ago that was caused by bad router code that took the zone offline. The subsequent instance creation storm, as recovery attempts occurred in other zones, followed by a tsunami of data transfers to feed them, resulted in outages lasting more than a day. If they had only waited for that router reboot!
The impact of data management
Ultimately, cloud resiliency depends on sound data management. IT needs a plan for data positioning that is focused on minimizing downtime during an ongoing outage. Ideally, the data for recovery should be prepositioned in the zone where the recovery will occur, but this works only for quasistatic data, not currently active records.
Let's use a four-zone distribution as an example. If one zone fails and looks to be down for a while, there should be a defined alternate that allows the DNS tables and so on to be updated and the instance building scripts to be run. A single zone is critical, since it focuses an administrator's effort on one job.
Most of the data in use is quasistatic, but a snapshotted version of the active data can be uploaded periodically from a zone to its recovery zone, leaving any remaining differences to be covered by a journal file kept in the recovery zone. That way, only a small update is required before the recovered zone instances are working again.
This approach to cloud resiliency works with hybrid clouds just as well as in the public cloud space. The hybrid cloud can keep its recovery set in the public cloud and the public instances can be recovered to another zone of that public cloud.
Containers bring an extra benefit to cloud resiliency due to their extremely fast startup. In the end, we'll probably find that VLAN setup, DNS update and other networking issues predominate in the time to recover, but software-defined infrastructure may give us a way to shortcut recovery processes and cut that time down. The objective is to be fast enough to recover transactions in progress without exceeding the attention span of the mobile user, but this is a wish list line item today.
Learn from outages to improve resilience
Resilience and disaster recovery battle it out
Built-in resilience enriches hybrid cloud storage