Steven Ross, executive principle and founder of Risk Masters Inc., recently spoke with SearchDisasterRecovery.com Assistant Editor John Hilliard about some of the best practices for preparing for and responding to a disaster in the data center. Learn data center disaster recovery best practices, including what needs to be in your data center disaster recovery plan to ensure a seamless data center recovery. Listen to the podcast or read the transcript below.
Download for later:
- Internet Explorer: Right Click > Save Target As
- Firefox: Right Click > Save Link As
What can you do to ensure an organized recovery of computer systems and other essential equipment? Should there be careful documentation of everything in the data center? Also, who is responsible for each piece of equipment, power requirements, etc.?
Let me start by pushing back on the question… the issue isn’t what do you do to recover it? The question is: What do you do to build it in such a way that these problems don’t become problems? The underlying answer to your question is to build resilient data centers, and build resilient data centers on an enterprise level. This is more applicable to larger corporations and government agencies. But anyone who says, “I have a data center and I’m dependent on it, but I don’t have any backup for that data center,” they’re dead right from the beginning. Too many organizations 15 years ago signed a contract with a hot site vendor, and they got a backup location, and they think, “Well, I’m covered.” But that’s not the point. The point is to understand how the equipment flows into the business, and the requirements for everything up to complete fault tolerance across environments, down to, “I can live without it for a couple of days and recover from tape and not worry too much.”
With that said, if you want an organized data center recovery, you have to think of the data center as a business process, with lots of interdependent moving parts. And because there are a lot of interlinked parts, all of the things that need to be done if a problem occurs must be proceduralized in written documentation. It is shocking to me to see data center managers who are basically just landlords, and have no idea what is in their data center. This is very dangerous, because when everything goes wrong, all those other people start pointing to the data center management saying, “Well, you should be able to recover this. But you never told me what it does or how it’s connected,” [and] all of that needs to be done.
What steps do you need to take to prepare a data center for a disaster incident?
You definitely need a comprehensive set of procedures for the infrastructure and for the applications. Each component, or group of components, generally has support infrastructure, and generally speaking, there’s a person or group responsible for that. So the servers are going to be under the server group, or the virtualization group, or both. All of these are generally working under an infrastructure group or an operations group, but come a major disruption, there’s a dotted line relationship with the disaster recovery management. And that kind of governance clearly needs to be spelled out, who is in charge, who makes the decision, what you do, and what sequence you do it in.
Should this documentation be a part of your disaster recovery plan or does it need to be created separately?
I’m not sure there is a distinction between the DR plan and separate documentation. Clearly, every function needs to understand their role, responsibility, and the timing and procedures they’re going to carry out. Taken collectively, those are the DR plan.
What can you do to ensure that your system doesn’t fail during the recovery process?
We are seeing a tremendous interest these days in what I call “infrastructure assurance,” making certain that within an operating data center, all the resources are protected and redundant. To a degree that is surprising and shocking to the management we’re dealing with, most data centers that we’re seeing are replete with single points of failure.
What’s the best way to test a UPS system?
The easiest answer to that question is to run it off the uninterruptable power supply (UPS), do a load test. There needs to be a real-time test. There are a whole bunch of National Fire Protection Association (NFPA) standards… it’s NFPA 110, that specifies power requirements for data centers and critical electrical areas. Data centers fall under that.
There’s a certain degree of risk, of course, and there’s a certain degree of risk of not knowing whether the UPS can bear the load doubling down on the generators. A lot of this is going to be driven by what your vendors tell you, so involvement of the vendors in the testing is very worthwhile. You can also get involvement from the power company. But the UPS as such, as an isolated piece of equipment, is important… but you want to think about it end to end. So it’s not just UPS, it’s UPS and generators, and conduits and PDUs, and do what you need to do to get the power from either the substation or the generator to the gear. A lot of this can be done and should be done as a matter of preventative maintenance anyway. You should be doing load tests on an annual [or] semiannual basis; you should be doing preventative monitoring; you should be doing infrared testing to see if there’s any frayed wires… all of that comes together.