Disaster recovery (DR) impacts all aspects of IT, including storage, networking, data center management and data security. Pierre Dorion, data center practice director at Long View Systems Inc., answers the most common DR questions that IT professionals are asking.
Table of contents:
>>Are tape backups a thing of the past in DR?
>>How can you improve recovery time without remote data replication?
>>How far apart should production and alternate recovery sites be?
>>What types of technologies can be leveraged for DR?
>>Are backup service providers a good idea?
>>How can you reduce the overall cost of developing a DR strategy?
>>What is the difference between RPO and RTO?
We have seen the use of tape decline rapidly over the past few years. If you had asked me that question a few years ago, I would have said no. But first of all, we need to understand the distinction between DR backup and typical backup.
If you're backing up data in case of data loss, maybe by accidental deletion, tape is somewhat useful and it's a very low cost media. But when we're talking about a DR application, what you're looking for is quick recovery.
We've been seeing disk really displace tape in the DR arena, especially with deduplication coming into the picture, which obviously is not available on tape today. Deduplication allows you to store a lot more data and potentially replicate it, offsite at a much lower cost. So that's making tape, from a DR perspective, a lot less attractive than it used to be. Even from a cost perspective, disk has become more and more affordable. So we're seeing a lot of backup solutions leverage disk as the media of choice for DR.
So there's a number of ways here where you can improve your recovery time and we need to take into consideration the disaster scenario that we're trying to protect ourselves from. The disaster may come in many shapes. If you're talking about a total site disaster that's one thing, but if you're talking about a massive system failure, that's another type of disaster.
So there's a number of ways to do this without having to spend a lot of money on bandwidth for data replication. One of the best approaches is trying to work towards having a leaner storage footprint. We can leverage some archive applications that allow you to reduce the amount of data that you're backing up.
We all know that on a given file server or even an email server, there's a number of items that are stored there and are needed immediately. If you're looking at things like PST files, on the file server there is a certain amount of data that is kept there and is not used on a daily basis. Sometimes you can see some pretty impressive stats when you do a file system assessment to see what kind of data resides there and is not used on a daily basis.
So there are a lot of solutions on the market that allow you to archive some of the data and push it away from your production environment, which allows you to perform much quicker backups and faster restores. So when you're restoring that data, whether it's remotely or locally, because you have a much smaller data set, you're actually reducing your costs in terms of replication. So really, you need to look at taking some of the unused data out of the backup loop to accelerate your pace in terms of data replication, but also save on bandwidth.
If you look at disk-based backups today, there's another opportunity to reduce the amount of storage space your data occupies. When you start looking at a lot of the deduplication solutions today, there's an ability here again to only replicate the unique data blocks, instead of the entire data set. So again, you're reducing your costs in terms of data replication.
Finally we can talk about WAN acceleration. That's another element that you can actually combine with deduplication that allows you to really reduce the amount of data that you're sending across the wire and therefore spending a lot less money on. Of course, you need to consider the cost of the solution, but in the end through usage, you'll find that you're saving on bandwidth a lot more than you're spending on implementing the solution.
That's a question that comes up very often and there's no one answer for this. I guess the first thing we need to look at is the type of disaster that we are most threatened by and that goes by geography.
If you're trying to protect yourself from a hurricane the area of devastation touched by these natural disasters, is of course much wider than something like a tornado or a fire. If you have a fire in your building, that is very localized and technically your alternative recovery site could be across the street or a few blocks down. But in the event of large scale disaster, such as a hurricane that could touch an area as wide as a one hundred or two hundred miles, obviously you want to move away from that area to protect yourself.
So it really depends on the geography and the kind of disaster that you're looking at. One thing to consider though, you may be tempted to play it safe and go as far as possible, but that also introduces other problems such as latency. If you're doing any kind data replication, it may actually interfere with your ability to implement synchronous replication. If you have very stringent recovery point objectives (RPO) and need this data to replicated synchronously, then it's not necessarily possible if you're too far away.
Another aspect to consider is access to your alternative recovery sites. Some companies prefer to manage these things themselves. But you need to consider the implications of having to send your team to a location far away. There is a cost associated with that, but there are also the complications that it may bring up. So you need to be aware of that when choosing your location. In fact, the SEC tried to rule on an appropriate distance, but quickly gave up on that because they realized that there are too many variables in play.
It is not an easy answer, but it has to be a split between cost and the type of disaster you're trying to protect yourself from.
There's quite a few that we can choose from and a lot of them have gained in popularity. Probably the main one today that has had the most impact on the DR market is server virtualization. It's very popular, and it makes a lot of sense. Virtualization gives you the ability to put together a replica of your production infrastructure from a server perspective at a very low cost, because you have the ability to run multiple virtual machines or images of operating systems and applications on a single, physical server.
You don't need to spend a whole lot of money implementing a stand by infrastructure that would be a one to one match to your production infrastructure. In fact, a lot folks are taking that technology even further and are using it in a production environment. Now we have the ability to failover from a virtual machine from one physical host to another, whether it's across the room, across town or across the country. So again, that technology will greatly reduce the cost of implementing a DR infrastructure. Of course, you add to that data replication and now you really have a good ability to failover to a remote location.
We're seeing now the ability to virtualize just about everything, including storage and network even, if you look at the Cisco Systems Inc. VFrame technology. You combine all of this together and now you have the ability to really build a good DR strategy and automate a lot of it, and automation when it comes to DR is a big plus. The fewer manual steps you have the better off you are.
There are a lot of good questions to ask here. This is probably something that is better suited for small to midsized businesses (SMBs), not that larger companies can not look at that, but sometimes a large volume of data becomes problematic if you have a lot of systems that's you're backing up. If you're sending a lot of data across a wire, the cost of replication can pose a problem. That's one aspect and when I'm saying it's better suited for SMBs, it's really because of that data volume.
There are also questions that you need to ask your service provider in terms of what kind of protection they're offering. Service-level agreements (SLAs) need to match your recovery time objectives (RTOs) and your RPOs. So it's a question of evaluating your cost of ownership versus the cost of outsourcing. That's always a difficult one to pin down, but you really have to do your homework on this because it can end up costing you more using a service provider than running your own.
With that said, there are other implications. Maybe you're a company that can not afford to hire a full time backup specialist, yet you need that level of expertise. So it starts to make sense to hire someone who specializes in backup to handle these things for you. And some companies that we run into just don't want to have anything to do with running backups. They do not want the responsibility and they prefer to outsource.
So it's a good idea, but you have to make sure that you know who you're working with and the kind of service level they can provide. It's nice to know that there is a guarantee and if they fail to meet your objectives, they'll give you a refund. But that doesn't mean a whole lot when your company is down and you're trying resume business, but you're data is unavailable.
It's kind of like talking about the warranty on an aircraft that's 30,000 feet in the air. You're not looking for the warranty papers at that point; you're looking for a parachute. So from a business perspective, having your data in someone else's hands, you really want to make sure that you will be able to recover that data should something happen and as quickly as your business needs it.
There is no single recipe, but obviously the one important aspect to always take into consideration when developing a DR strategy is to make sure that the plan meets your requirements. We all now that not all applications are created equally. So you need to go through a categorization exercise of your applications and clearly define the RTOs and RPOs to avoid trying to take a one-size fits all approach to your DR strategy and applying the same model to everything.
Where it is essential to ensure high availability or very quick recovery, obviously you will want to function around that. But that doesn't mean you need to apply the same model to everything. So there has to be a lot of planning going into this to make sure that you're making the right decisions and building the appropriate infrastructure without over spending. You always want to ensure that the final cost of your recovery strategy does not exceed the losses you're trying to prevent.
RTO is not the time it takes to recover a system, which some people tend to think that's what it means. It is actually an objective; how quickly you need your systems backup and running. This is driven by potential losses and business requirements.
The business will dictate how quickly you need a business process to resume incase it is interrupted and ultimately, how quickly the supporting infrastructure needs to be back up and running. That is your RTO.
The RPO is really a point in time. So when we say that we are backing up daily, that gives us a recovery capability of up to 24 hours. So the point in time is up to 24 hours. Again, an RPO is an objective that is also driven by a business requirement. It will also dictate how much data loss is acceptable to your company.
Let's go back to daily backups for an example. If you back up at night at 6:00 p.m. and the server goes down the following day at 4:00 p.m., then you've potentially lost 22 hours of data that was created during that day. If you have no ability to recreate that data, then the data is lost.
So your RPO from a business perspective will dictate that you need data down to the last transaction. For example, if you're processing credit card transactions, you can not afford to lose any transactions. So then your RPO becomes 0, which means you can not afford any data loss.
So indirectly, that also dictates the kind of technology you need to put in place to ensure that you can achieve your RPO. This is very different from the RTO, although the RTO will also dictate the kind of technology you will need to put in place. The RTO is more about a maximum tolerable outage. So those work hand in hand in defining what we need to put in place to meet our objectives; one being how quickly we need things to be recovered and the other being to what point in time.
Pierre Dorion is the data center practice director and a senior consultant with Long View Systems Inc. in Phoenix, Ariz., specializing in the areas of business continuity and DR planning services and corporate data protection.