An essential recovery capability covers six straightforward things: A place to go, systems to recover on, data to recover with, a network to connect to, procedures to follow to restore systems, and the people to carry it out. If one of those basic elements is out of place, you don't have a recovery capability. Is there such a thing as a foolproof disaster recovery plan?
No. Your company can perform due diligence and plan for what's appropriate based on its risk profile. But little in life is foolproof. It's virtually impossible to nail down everything that can happen and plan for each and every scenario. You protect for what's appropriate for your business based on your risk threshold. How do you determine a reasonable plan of action for your company?
There are a couple of different things to consider. There are frequently occurring, low-impact events like a short-term network outage. And then there are low-probability, high-impact events, like a plane falling out of the sky onto your data center or a nuclear attack. The high-impact events probably aren't going to happen. For example, if your data center is in Houston, down near the Gulf, and it's in the basement, the likelihood of a flood is actually very high. How geographically disperse should companies place their backup facilities?
You can recover your technology more proximate to your business than you used to. I think recovering your San Francisco production data center in Philadelphia is useless. It's a couple of thousand miles away, plus you have to transport your technical staff to the site. What we recommend is to recover your technology within the region, but outside potential area of impact. Can you plan for something like 9/11?
You can plan for complete loss of the data center. That doesn't necessarily mean a plane flying into the building. It could mean a flood where all the equipment is ruined. It could mean a fire. Here's a wild example: I had a customer that built a $60 million data center in Texas. They built it on a landfill, and left a gap between the ground and the data center floor. After two years, they found this mold growing in the space that would have destroyed the data center, if they hadn't caught it in time. There's no way of really cleaning systems and network fiber. They would have just had to punt and move someplace else. You can't really anticipate that event. What are the different types of backup products and plans?
There's a host of different types of solutions on the market today, dealing with mirroring, replication, getting data off site, backing up, or shipping equipment in (as opposed to backing it up). Based on the business requirements of the individual company, we create a backup plan on three or four tiers. Tier one is applications that are absolutely vital to business, say a payroll application or a clearing application for a financial firm. The impacts are very high and the recovery times have to be very short. Tier two is doing some backup-to-tape and leveraging a co-location site for failover. Tier three is a two-to-three day recovery window: You quick-ship in NT systems and AS/400s and network gear to a pre-designated site. And tier four is punt -- in other words, these applications are non-essential, and we can live without them. How can a systems administrator/IT manager convince higher-ups to invest as much as necessary?
Disaster recovery is a business problem, but the burden of the IT manager. The first thing the IT manager has to understand is, what are the needs of the business? How does their company generate revenue, and what are the impacts of lack of access to the technology or the facility? Often, it's very hard for the IT manager to do an effective impact analysis because they don't have the proper methodology, process or tools. Nor the time, because they're running around with their hair on fire tending to production issues. If the IT manager wants to move this forward, they need to understand what the business requirements are and translate that into business terms for the CFO and the CIO. What are companies not doing, that they absolutely should be doing?
They're not testing enough. They'll put together a backup capability and not have an effective test program. Sometimes it's logistically hard to carry out a test, especially if you're talking about a fairly complex environment. A company might get an audit comment that says, okay, you're backing up your data, you're taking your tapes off-site and you've got a place to go in case of disaster, but you've never tested. The new regulations, for example the FDIC examiner's handbook, focus heavily on testing. If you can't pass that audit exam, that's a huge issue.