Over the past several decades, thousands of technology disaster recovery plans have been written. Many have been...
exercised; subjected to gap analyses, disaster recovery plan audits and other assessments; and revised to improve their functionality. But how do you know your plan will perform as expected during a disruptive event? The only true way to see if the plan really works is to experience a live disaster and see if the plan successfully mitigates the severity of the event and minimizes the disruptive effect on the organization.
In the meantime, assuming no real incidents occur, how do you know if your DR plan will perform as needed? One way is to conduct exercises of the plan to see if the procedures delineated in the plan perform as needed. It's certainly not advisable to abruptly shut down mission-critical applications to test the DR plan during regular business hours. Such shutdowns could be done using nonproduction processing environments or outside normal business hours.
Another technique is to use a disaster recovery plan checklist of activities -- i.e., controls -- that must be part of a DR plan and its supporting activities. Although this approach is certainly no substitute for a live exercise in which systems, networks or other resources are disabled, it's a good starting point to ensure that all or most of the plan development issues are addressed.
Many books, technical articles and papers have been written about preparing extensive disaster recovery plans. Software tools have been developed to facilitate the process. Do-it-yourself plan templates and aids are available on the internet using standard search engines. And a growing number of managed service providers today offer disaster recovery as a service, which goes beyond data backup and recovery to developing and exercising DR plans.
The following sections discuss the issues associated with preparing a comprehensive disaster recovery plan. As you'll see, dozens of controls have been defined over the years to identify weak points in DR plans. A key reason for having DR plans is risk mitigation, specifically identifying potential IT risks, threats and vulnerabilities and how to prevent and mitigate them if they occur.
Evaluate the disaster recovery plan
Certain fundamental elements should be in the plan, so auditors or others reviewing the plan will know that they are present. Among the key audit controls are a stated policy; step-by-step procedures for recovering disrupted systems and networks; contact lists of emergency team members, vendors, senior management, first responders and other stakeholders; procedures for assessing the incident and determining how to respond; procedures for communicating the incident to employees, senior management and others; and procedures for ensuring the safety and security of employees.
General disaster recovery plan checklist components
Additional areas for consideration include training for DR team members, vetting team candidates to ensure they are qualified, identification of roles and responsibilities for team members and senior management in an emergency, use of a program management process, identification of mission-critical systems, applications and network resources; priorities for system recovery (recovery time objective); availability of the most current data (recovery point objective); and procedures for dealing with employee concerns, such as notifying family members of the incident.
From an audit perspective, policies and procedures are key controls for examination. Regrettably, many DR plans have no policy associated with them. Certain organizations, such as financial institutions and energy and utility companies, must comply with extensive regulations at federal, state and even local levels. A well-written policy can clearly spell out the company's intention to be compliant with specific regulations. Policies may also spell out performance expectations, such as performing up to specific key performance indicators.
Business impact analyses and risk assessments
When preparing a DR plan, the planning team needs data on the critical business processes; technologies that support those processes; risks, threats and vulnerabilities to those processes; the people resources needed; and the timeframes specified by the business that must be met when recovering those systems and processes. The activities that gather this important data are business impact analyses (BIA) and risk assessments (RA). Lack of data from these activities means the DR plans might recover the wrong systems, ignore the true threats to the organization and recover the wrong data in the wrong priority and timeframe. Such a failure could result in a major business disruption or possibly a total loss of the business.
This article is part of
Disaster recovery strategies
BIA and RA outputs help frame the organization's IT disaster recovery strategies. These can include using backups to protect systems and data, resilient network infrastructures to minimize single points of failure, designating backup employees for every DR team member, completing a skills matrix to ensure that critical skills will be available following an event and establishing an inventory of spare components, such as servers, routers and cabling. Strategies are essential ingredients for preparing DR plans, as the plans reflect the strategies by defining step-by-step procedures for enacting the strategies.
Critical IT disaster recovery plan checklist components
Additional IT DR plan elements include a decision process for launching DR plans, guidelines for dealing with the media, guidance on communicating with vendors and other stakeholders, guidance on obtaining emergency funding for the purchase of equipment, setting up a command center to manage the disaster, guidance on setting up alternate work areas, and procedures for resumption of IT operations and key business processes.
Other key considerations
Some additional items include awareness and training of employees and their roles in an IT disruption; documentation of all relevant plan components; engagement of other departments -- e.g., human resources, facilities management -- in the preparation of the DR plan; and use of corporate change management and project management processes.
Test and update the disaster recovery plan
As previously noted, the best way to evaluate a disaster recovery audit checklist is to have a live disaster. The next best option is to conduct exercises of the DR plan and tests of recovery procedures for individual systems and network resources. Exercises typically validate that the approach to recovering technology and its effect on the business is being fulfilled by the plan.
Tests typically look for a pass-fail result. Specifically, if the test shows that a disabled system can't be recovered, restarted and used in a normal manner using the procedures in the plan, that plan fails. If the recovery and restart procedures result in a properly functioning application, the plan passes. This is very important because procedures can be written incorrectly, have spelling errors or be in the wrong sequence. Each of these situations results in an unrecoverable system.
After a test and/or exercise, conduct a post-event discussion of what happened and what must be done. After that discussion, prepare an after-action report that summarizes the test/exercise, what worked and what didn't, plus lessons learned.
Most of today's disaster recovery guidance includes a well-defined framework for DR programs and plans. Following these disaster recovery plan checklist items is a good start, but it doesn't mean that your plan will save your business when a disaster occurs. To ensure that you have covered the relevant bases, consistently look for ways to improve your disaster recovery plan.