The risk assessment process is an important initial step for disaster recovery planning, but understanding risk alone is somewhat irrelevant without also understanding impact. To use a simple analogy, the risk of losing a coin toss is fairly high (50%) but the impact can be negligible if it was to decide who starts a friendly baseball game. When developing an IT disaster recovery plan, two important questions must be answered before...
considering what constitutes the most efficient and cost-effective recovery strategy.
These questions can be summarized as:
1. What disaster scenarios could realistically happen to our IT environment?
2. How can a given disaster scenario affect the business supported by that IT environment?
These questions represent the need to evaluate the risk facing the IT environment and determine the consequences of that risk, as well as the effect it would have on the organization.
What is a risk?
There are many ways to define what constitutes a risk from an IT perspective. One of the generally accepted definitions is “the existence of an exposure to a known threat, combined with the probability of occurrence.”
This is where it becomes important to not get too hung up on terminology and to keep it simple if we want to get some work done. For example, an exposure to a known threat could be the absence of redundant server hardware, the threat could be hardware failure and the probability of occurrence is fairly high. This constitutes a risk.
It is very important to identify risks very early in the DR planning process, but it is equally important to not begin DR strategy development as each risk is identified.
It is too easy to identify hard disk failure as a risk but just as soon cross it out because there is a data backup in place and therefore, disk failure is no longer a risk. The truth is, the risk still exists but controls have been put in place to reduce its impact.
It is considered best practice to first assess each risk and analyze its possible impact on the business, and then evaluate existing controls that would reduce or eliminate that particular risk’s effect if it occurs. The implementation of additional or missing controls should come after that process.
In other words, that risk still exists, but controls have been put in place to reduce its impact.
While evaluating the probability of a risk is important, it should be a reasonable evaluation of how likely it is to happen. Acknowledging the fact that an unplanned outage could eventually affect your IT environment, for example, is more important than trying to figure out the odds.
What to look for
Another important aspect of the risk assessment process is to avoid getting too granular by trying to list all possible threats and exposures to an organization. Reasonable probability of occurrence must also prevail. Trying to plan for a disaster caused by an aircraft striking your data center or a solar flare frying processors in your servers should probably be grouped under broader categories such as “site-wide disasters” and “hardware failure.”
You also must refrain from listing solutions when identifying threats. It is advisable to list all reasonable risk factors first, and then review existing controls or measures in place to ensure they are adequate. For example, the risk of a hurricane damaging your server room should not be crossed out because you have a failover site. The risk of a hurricane is still the same, and you must consider how to minimize the potential for damage.
An IT environment could be exposed to several high-level potential threats, including:
- A lack of redundant data center critical infrastructure. This includes single UPS or power distribution path, no backup generator, cooling system with single points of failure or inadequate fire suppression, etc.
- Geographical and climatic related threats. No matter how redundant the data center and the IT infrastructure are, the entire facility can be the single point of failure if it is affected by a climatic event.
- A lack of redundant IT infrastructure components or existence of single points of failure. This is a broad area that ranges from high-level items such as single network link and single server for critical apps to granular things like servers with single power supplies, etc.
- Weak physical and logical security. Unlocked doors, weak intrusion prevention or detection or security precautions for visitors.
- Inconsistent data backup process. Frequently failed backups, poor reporting or monitoring, absence of offsite copies of backups or inconsistent transport of backups offsite.
- Undefined recovery time objectives or recovery point objectives. This can lead to the false assumption that the data backup schedule, frequency or even method is adequate when in fact it is not.
- A poor change management process. Absence of adequate change control is a frequent cause of unplanned system outages and data loss due to human error, mistakes, lack of planning or testing, etc.
- Absence of configuration documentation. Heavy reliance on the availability of knowledgeable IT staff is a poor substitute for basic systems documentation.
- Absence of a disaster recovery plan. A lack of a disaster recovery plan should always be documented during the risk assessment process.