Disaster recovery (DR) planning is fundamentally the result of, or an action driven by, risk management. The term "risk" can mean different things to people, depending on the context in which it is used. For example, to insurance underwriters, everything is a risk and premiums are calculated based on the probability that it will materialize. For others, a risk is simply something worth avoiding or at the very least, mitigating.
It is therefore a good idea to establish what exactly risk means to your business. A somewhat scientific definition of risk includes measuring the probability of a threat that is exploiting a weakness or vulnerability and its consequences. When translated into an IT context, this could mean the chances a hardware failure will cause your data stored a RAID 0 disk array to be lost.
While we have clearly expressed probability, threat, vulnerability and consequences in this example, we have not defined the potential impact. Without impact, a risk can be meaningless. Based on our example above, if you had a backup copy of that data on tape, the impact of having lost the original on disk has just been significantly reduced, which is a form of risk management.
Most companies have some form of enterprise risk management program in place to address the risks inherent to running a business. Considerations are given to market demand, competition, the state of the economy, etc. for guidance in the business decision-making process. That said, many companies tend to overlook their internal processes or resilience when evaluating risk. In almost all cases today, the ability to provide goods and services depends on the availability of data and the systems to process it. For the purpose of this discussion, let's focus on understanding some of the risks associated with data and storage.
Disk storage-related risk
|Snapshots and mirror copies||When a database becomes corrupted or parts of it are accidentally deleted, it can take some time before corruption is detected. All snapshot or point-in-time copies can be affected if data consistency is not performed. The vulnerability here is not so much the storage technology, but the possible reliance on it as the unique source of data protection.|
|Human error||If the production copy of a database becomes corrupted or unusable, there needs to be a mechanism in place to prevent overwriting the replicated copy with the bad copy by mistake.|
|Hardware failure (or SPOF)||Single points of failure (SPOF) must be identified from the host all the way to the allocated storage.|
|Insufficient security in a SAN environment||Many storage experts agree that storage access to systems and applications should be controlled at the HBA, SAN switch and disk controller level to avoid volume contention between hosts.|
|Poorly documented configuration and recovery procedures||This can lead to the exposure of knowledgeable staff being unavailable following a major outage or disaster.|
|Lacking segregation of duty||Too many IT personnel with unrestricted access to storage configuration interfaces or utilities can lead to inadvertent or unauthorized changes causing data loss.|
|Poor or inexistent change management policies||A lack of change management is probably one of the most common vulnerabilities, but is often overlooked because IT personnel typically don't consider themselves threat agents. However, poorly planned changes are frequently identified as the cause for storage failure of data loss.|
Backup data-related risk
|Single copy backups||There is potential exposure to data loss in the event backup storage media is damaged or lost.|
|Daily backups, but weekly offsite only||This creates an exposure to as much as one week's worth of data loss if the production data and backup copy are destroyed.|
|Insufficient backup frequency||Recovering data to its latest state is now a common requirement in many environments. Daily backups may be insufficient and can create an exposure to partial data loss.|
|Backups exceeding available window||This can impose backup schedules that do not meet business requirements. For example, a full backup is only run on weekends because it exceeds 24 hours and interferes with production.|
|Unencrypted data on offsite-bound media||This can cause a security issue in some cases (industry specific).|
|Inconsistent data state backup||This can happen when backup software is not "application aware," causing data to be captured and sent to backup media in an inconsistent state|
|The offsite vault is the trunk of your car||Hopefully, this exposure requires no explanation. Poor environmental storage conditions can seriously compromise backup media|
Obviously, the list of threats, exposures and risk to which storage can be subjected could be much longer and include many external or indirect elements that also threaten the entire IT environment. This list offers a good starting point.
Likewise, measuring the probability of occurrence and resulting impact can be difficult, but must be at least estimated when developing risk mitigation or disaster recovery plan. Some risk can be tolerated if the impact is minimal or the probability of occurrence is very remote. While it is not practical, cost effective and even possible to systemically eliminate all risk, efforts must first focus on high impact risks having a higher probability of occurrence.
However, as it is true also for disaster recovery strategies, always remember that the cost of risk mitigation should never exceed the losses it is designed to prevent.
Pierre Dorion is the data center practice director and a senior consultant with Long View Systems Inc. in Phoenix, Ariz., specializing in the areas of business continuity and DR planning services and corporate data protection.
Do you have comments on this tip? Let us know. Please let others know how useful this tip was via the rating scale below.
Do you know a helpful storage tip, timesaver or workaround? Email the editors to talk about writing for SearchDisasterRecovery.com.