When it comes to executing a disaster recovery (DR) or business continuity (BC) plan for your network, time and accuracy are of the utmost importance. The goals of disaster recovery and business continuity are time sensitive and critical, so checklists make an ideal tool when faced with a situation where those plans get called into service.
The following activities define a core set of actions or activities that must come into play whenever disaster recovery or business continuity scenarios are invoked:
- Detect outage and disaster effects as quickly as possible
- Notify responsible parties that they must take action
- Isolate affected systems to limit the scope of overall damages or losses
- Repair or replace critical systems, and work toward a resumption of normal operations as circumstances permit and dictate
Background for a networking disaster recovery checklist
When it comes to invoking DR or BC plans, the initial trigger relies on speedy detection and timely notification. That's why it's essential to use "heartbeat" or "keep-alive" technologies like those built into many modern routers, switches, or servers as part of their management instrumentation. At the same time, this means checking key assets: systems, infrastructure elements, WAN links, and so forth, and making sure they are all properly instrumented, and post notifications within a stipulated time period after any kind of failure occurs.
Likewise, it's vital to construct a DR or BC plan that includes all networking assets, including switches, routers, firewalls, proxies, caching servers and load balancers to establish a backup regime that includes sufficient storage that's remotely accessible online in the event that some kind of recovery is required.
Simulated failures should include network failures at both the WAN and LAN levels to make sure all detection and recovery procedures work as planned, and that recovery time objective (RTO) and recovery point objective (RPO) levels are met. Once a practice drill is over, the key activity in closing the learning loop is to update your disaster recovery or business continuity plan, to adjust RTOs and RPOs to conform with workable actual cases and to adjust on-the-ground implementations to reflect learning and experience garnered in the wake of any drill.
Simple one- or two-page recipes that provide step-by-step instructions on what to do and how to do it, make the most effective hands-on tools for responsible parties to use when enacting disaster recovery or business continuity plans in the field. Checklists can be a valuable component for such tools, and can help speed processes related to problem identification, resolution, and repair or recovery.
What to include in your networking DR checklist
Each checklist begins from an inventory of networking and systems equipment, services and applications, where there's a separate checklist for each such item. When it comes to networking equipment, it's essential to include key infrastructure elements -- such as routers, switches, and WAN optimization devices -- in the drills, and to make sure that recovery or repair efforts produce functional networks, systems, and services. It's also important to model multiple types of failure to make sure plans and checklists address them adequately. This includes carrier-level access, equipment, media and system failures.
In general, you should create a step-by-step recipe for each type of failure for each inventory item. For a WAN optimization device, for example, outright device failure would include items like the following:
- Run diagnostics to establish state of device. This includes a step-by-step series of commands or GUI actions documented to support "monkey-see, monkey-do" operations.
- For outright failure, obtain a spare, import configuration profile. Provide information on where to find spares, how to check one out, what to disconnect from the old unit and how to reconnect the new one.
- Run diagnostics on replacement unit to make sure it functions properly. This is a step-by-step series of commands or GUI actions as in first step.
- Remove the failed unit and replace with tested unit.
- Test unit to make sure key sample services work properly. Define a detailed series of checks/test operations; ideally these should be supported by automated test scripts or step-by-step command or GUI level instructions.
- If unit passes tests, report successful replacement and restoration of service; if unit fails tests, return to the first step.
As your staff works through its practice drills, they'll be interacting with, and reacting to, these recipes at each step along the way. Encourage them to take notes and ask questions about what they see and don't understand, instructions that don't work as described, or activities that don't make sense. You can use this information for post-drill audits, and adjust or replace your recipes and checklists to help you keep them relevant and usable over time.
About this author: Ed Tittel is a long-time freelance writer and trainer who specializes in topics related to networking, information security, and markup languages. He writes for numerous TechTarget.com Web sites, and recently finished the 4th edition of The CISSP Study Guide for Sybex/Wiley (ISBN-13: 978-0470276886).