Large IT disasters can and do happen, and the only way to adequately prepare for them is through a comprehensive disaster recovery testing plan.
In August 2016, Delta Airlines suffered a major outage at its Atlanta command center due to a power failure. Lasting for almost six hours, the outage resulted in a global computer failure that impacted airports as far away as Tokyo and London. Gate agents had to write boarding passes by hand, and the airline was forced to communicate with passengers through Twitter.
Over the next several days, the chain reaction of events forced the cancellation of more than 2,100 flights and tens of thousands of stranded passengers. While Atlanta was the only city to suffer a power supply disruption, computers around the world were impacted because of a complex system of interdependencies that ultimately relied on the computers in Atlanta remaining online.
As with any outage of this magnitude, the Delta incident raises a number of questions. How did a single power failure ground an entire airline? What could the airline have done differently? What did Delta have for a disaster recovery testing plan? And what should IT pros be doing to protect their own organizations?
It was later discovered that approximately 300 of Delta's servers were not connected to a source of backup power. This lack of connectivity to a backup power source likely contributed to the incident, but this also points to a much deeper problem. Had Delta performed DR testing, the backup power connectivity issue would have been discovered much sooner.
Ideally, failover testing and backup power supply testing should have been done at the same time. While it would have been unwise to cut the power to see if critical workloads fail over to an alternate data center, Delta's IT department could have initiated a manual failover and then tested the backup power after all of its critical workloads had been safely moved to a location that would not be impacted by the tests.
Four key DR testing practices
So what can corporate IT implement in its own disaster recovery testing plan to avoid a Delta-like failure?
Accept the idea that the data center is not a static environment. Changes occur on a daily basis, ranging from the installation of a new OS patch to something much more involved. Each change within the data center has the potential to derail DR mechanisms that worked previously. As such, a DR testing plan must be an ongoing process, not an annual event.
Evaluate information systems in an effort to detect single points of failure. Don't make the mistake of examining the infrastructure solely from a component level -- delve into the system level. In Delta's case, something in the Atlanta data center acted as a single point of failure and caused a worldwide outage. While it is tempting to point to the power supply as the single point of failure, the real issue was the system connected to that power supply. After all, computers in Tokyo didn't fail because there was no power, they failed because a system in Atlanta was offline.
Have a mechanism to automatically fail critical workloads over to an alternate data center. Having failover capabilities isn't enough; the secondary data center must also have sufficient resources available to absorb a failover. While this sounds like a basic principle, it is easy to overlook the secondary data center when scaling workloads in the primary one. When this happens, the secondary data center may have insufficient resources to accommodate a failover. Likewise, it is necessary to put policies in place to prevent resources in the secondary data center from being used for purposes other than business continuity. Otherwise, those resources can be encroached upon by someone's pet IT project.
Periodically evaluate the bandwidth consumed by off-site storage replication. If an organization has created a disaster recovery testing plan and implemented a secondary data center, it will presumably have put in place a storage replication mechanism that mirrors data to the secondary data center. As workloads scale, the amount of bandwidth consumed by the replication process can increase. This can cause bandwidth requirements to eventually exceed the link's capacity.
Avoid these DR testing mistakes
Lack of disaster recovery testing leads to problems
DR planning doesn't need to break the bank