Disaster recovery (DR) impacts all aspects of IT, including storage, networking, data center management and data security. In this tutorial, Pierre Dorion, DR expert and senior business continuity consultant, answers the most common DR questions that IT professionals are asking. His answers are also available as an MP3.
>>Are tape backups a thing of the past in DR?
>>Are backups and archives essentially the same thing?
>>How can I improve recovery time without remote data replication?
>>How do I establish recovery priority for my applications?
>>How do I identify what data to replicate and what data to backup?
>>How far apart should production and alternate recovery sites be?
>>Should I back up the operating system files?
>>What do tiered storage and ILM have to do with DR?
>>What backup vaulting frequency best practices exist?
>>What is the difference between disk backup and data replication?
>>What is the difference between RPO and RTO ?
>>What is the most important aspect of data protection in DR?
It is very interesting how quickly things change in our industry -- My answer to this question a year ago was highlighting replication or mirroring for highly critical data, disk-to-disk backups for middle tier data and finally, low cost tape backups for the tier 3 backups and archives. The main driver behind this rationale was the cost of tape backup media which was still the cheapest media. At the same time, a newer technology was surfacing and slowly getting backup administrators' attention: data deduplication for disk-based backups.
When we look at the major and rapid advances in the field of deduplication (companies like Data Domain are now on the fourth generation of their product), tape has lost a lot of price advantage. With data reduction ratios averaging 20:1 in some cases, deduplication technology has reduced the cost of backing up to low cost disk by the same order of magnitude -- making it that much more competitive when compared to tape.
Add to that the risks inherent to tape media handling such as damage or even worse, media loss. The latter has been plaguing many companies in the financial sector, making the replication component of many deduplication technologies a clear advantage.
That said, tape is still not dead. Data deduplication is not as effective with image files such as scans, pictures, MRI, etc. Furthermore, some organizations have requirements for data encryption at the source, which seriously reduces or even nullifies the benefits of backup data deduplication. Also, with the recent announcement of encryption at the tape drive level (LTO-4), tape technology may have bought itself a few more years.
The answer to that can be a "yes" and "no." If we look at a very high level, a copy of data is a copy of data, and that's where a lot of people confuse both as being somewhat the same -- one copy is just kept longer. When we start digging into what a backup is for and what an archive is for, that's when we really start seeing the distinction between the two.
A backup is really a copy of a file to protect yourself against data loss should something happen, so it's always a much more immediate need. An archive is really a copy of certain records that could be completely taken out of context of their initial environment or structure and that are kept for future reference, not for recovery purposes.
For example, if you have a customer database, there are all kinds of records in there. Some of the records may date back to 1999. Chances are you don't really need those to do your everyday business, but this is something you keep for whatever reason. There's a possibility here to archive part of this, yet you still backup your database everyday in case something happens because you need to access these records immediately.
So, there's a very specific distinction between the two and archives should not be used or considered to be used for DR purposes -- really a backup is what counts for DR. You would restore from your backups, but you don't necessarily retrieve your archives following a disaster. This archive data is far away, kept somewhere safe in case you need it; not for recovery purposes.
This is a growing concern in the industry today because we've seen data growth at 50-60% on a yearly basis. The file servers are getting huge. Mail servers are becoming just impossible to manage or to restore. And a lot of that [difficulty] is caused by a fear of making an actual decision with respect to the data because of all these compliance issues -- we're afraid to delete data now because that may get us in trouble, so we've seen these servers grow to become enormous.
Yet, sometimes we can't restore them. I've seen instances where a file server took 72 or more hours to be restored. In some cases, that's completely unacceptable from a recovery time objective (RTO) perspective. The first thing that comes to mind is: "We'll have to replicate all that data and have a standby file server ready to come up should something happen." Well if that's what you can afford, and what you need, that's fine.
But, if you can't afford to spend that kind of money, yet need access to your data, today there's a number of solutions out there that allow you to make your mail servers and your file servers a lot leaner from a data perspective. I'm talking about archiving products here. Enterprise Vault from Symantec is a fine example. You can actually take some attachments or some email messages or some files on a file server and move them to another type of storage -- lower cost storage or potentially storage that you will not restore immediately should something happen.
What we're looking at here is an opportunity to reduce the size of your mail server (e.g., Exchange) to a size that is easy to backup, easy to restore and that contains what you need to access on a daily basis.
The beauty of a lot of these solutions is they're a little like HSM -- they move the bulk of the data to another location, but leave a little pointer in your email program or on your file server. So, from a user perspective, everything is still there [on the server]. They look at it; they see it; if they click on it, it may take a little longer to come up, but the file is still there. You're not hiding that data really, you're just putting it somewhere else, and you're leaving a bit of a shortcut. That allows your servers to be much leaner, and it really enhances the recoverability. You can back them up in a nightly cycle, and you can restore these servers within a reasonable timeframe.
If it's at the IT level, it comes from a data perspective. When you start moving up a little bit into the business side, people start looking at the applications because they don't really know where the data is. If we talk today about virtualization, you lose where that data is, so you're looking at the application, but it's really essentially the same idea. The recovery priority for your applications is based on the same criteria as the actual value of the data they use. It's really: Which business processes make use of those applications and what is their criticality?
The priority itself really goes to dependencies as well. For example, DNS (which could be considered an application or a utility depending on how you look at it) could have a higher priority than some other systems because if you don't have DNS, you don't have a response on your Internet connections or your network connections. Obviously, dependencies are also very important and it's not necessarily a dollar value at this point -- priority becomes a technical question. Active directory, authentication and security applications move up in priority and need to be restored before anything else can be restored.
Something that's very often overlooked is the backup application itself. It's probably one of the most critical applications that you have, although it doesn't show up as an application most of the time. It becomes an infrastructure component. You need to restore your backup server and your backup application before you can start restoring data.
When we look at a business, there are always core functions and support functions. Backups are definitely a support function. In the banking industry, from a data replication and criticality of the data standpoint, it's definitely the financial data that would be important and the most critical. However, when we're looking at an application priority, we need to start looking at support functions and core functions.
Typically, from a business continuity perspective, the best way to establish this is through what we call a "business impact analysis," which really measures the impact of an outage on your revenue stream or your organization from a public perception point of view. You need to start from the business processes and look at what applications these business processes depend on -- and then you move down the chain. Once you've identified the application and the dependency, you can start looking at the data that these applications use, and categorize the data based on the business processes that use that data and the importance of those business processes to the organization from a revenue perspective.
We always have to make the distinction between qualitative and quantitative impact. Quantitative impact is fairly easy to measure; it's dollar value. The qualitative impact is really from a public perception -- loss of confidence in your company or your brand. For example, if you can't do banking for a few hours online, the bank could lose a few customers. So, the impact of an outage is not necessarily measured in dollars, but eventually will translate into dollars. That's the qualitative impact. Really that's what drives the value of data, and in the end, helps you identify and categorize the data to decide what you should replicate. This is also where you can start moving into storage tiers.
If you're along the Florida coast, chances are you should be quite far from your production site because of hurricanes. As we saw last year , it was pretty devastating in New Orleans and Alabama and Florida, so 10 or 20 miles will definitely not be sufficient in those areas. Conversely, if you're in Texas and you're trying to protect yourself from tornadoes the distance can be less. What we've seen in recent surveys that were done from a DR perspective in the industry is an average of about 30 miles.
There was an effort from the SEC to try and impose a standard distance. They quickly abandoned that because of that very specific fact of the geography and what you're really trying to protect your data (or your data center) from.
The distance also depends on criticality and potential losses. If you are at risk of losing millions of dollars because your data is not accessible, then it may be acceptable to be far away and spend that kind of money. It's always a question of balancing your losses with the cost of protecting your assets, so it may not be financially feasible to be very far away (e.g., halfway across the country). We need to measure the losses versus the risk and the value of the data and the cost of the solution.
Another factor to consider also is the latency or rate at which transactions or I/O has to be replicated from one place to another. So, if you're replicating data from a production site to a remote site, you need to make sure that data makes it there on time -- especially if the recovery site is very far away. There may be latency that may not be acceptable to your business.
Once again, there are a lot of factors to consider before a decision like that is made. One size does not fit all. It requires a lot of thinking and planning beyond just an IT perspective. We'd really need to look at the business side of things including business requirements and potentially regulatory compliance requirements as well.
Here is another prime example of technology evolving to meet business requirements. I have always been a firm believer that, except for small IT environments, operating systems files should not be part of regular backups. This was especially true when using a full and incremental backup product. There are too many identical static files (DLLs and patches) across systems that end up using too much backup storage capacity.
Once again, this is where technology changes come to the rescue. Data deduplication has dramatically reduced the impact of backing up OS files on the backup storage infrastructure. Since identical data segments are not stored (only pointers) when using deduplication, it becomes much less of an issue at the storage level. However, if backed up, software still has to inspect every OS file for changes and send them across the network even is only a single byte has changed since the last backup. There is still the issue of the registry that makes a traditional backup of the OS files questionable.
Using the replication component of server virtualization technology or ghosting, OS and registry replication, or clustering and system recovery products is still a better alternative then restoring OS files or reinstalling the OS. Furthermore, these systems images can now occupy far less storage space when combined with deduplication technology.
That idea ties back into the topics of data growth, data control, data management and recoverability. Once you start categorizing your data based on criticality and recovery priority, it gives you an indication of your data segments. We have our high-priority data, we have our medium criticality data and we have our low restore priority data. This is a perfect opportunity, if you're looking into doing tiered storage, to start dividing your data or using those categories to store your data.
From a cost perspective, if your data is highly critical to your organization and you want to replicate it because it's so critical you can't afford downtime -- that data probably belongs on your highest-performance, most redundant storage array. Conversely, if you have data that can wait because there's really no rush to restore it, then maybe that belongs on your lowest tier or your lowest cost storage -- or even potentially archived on tape. Tiered storage can complement DR efficiency, ensuring that the most critical data is recovered first.
Information lifecycle management (ILM) ties right back into all of this. ILM, or data lifecycle management, is all about categorizing the data based on its criticality or potentially some legal requirements, but again ties very closely into tiered storage or data tiers. Do you keep your less valuable data on your top tier? Probably not. If the data is reaching the end of its lifecycle it should probably be moved to a storage media of lesser cost and lesser performance because at this point we're just keeping it for potentially legal reasons. Tying back into DR, we're probably not going to restore that less valuable data immediately following a disaster. All of this information can help you classify data based on criticality, value to the organization or legal implications.
There seems to be a belief that if you're a smaller shop or smaller business, you don't have a lot of data, and you don't need to send your data offsite very often. A lot of times, the decision to send tapes offsite is dictated, unfortunately, by the growing capacity of tape media. If you're backing up 100 gigabytes a week, it becomes embarrassing to a lot of folks to send that tape cartridge offsite because it's not full. So, a lot of companies have a tendency to want to fill that tape before they send it offsite -- which is a big, big mistake.
Regardless of the size of the organization, there is an equal requirement to protect your data. Because you're a smaller company doesn't mean you can afford to go out of business. But unfortunately, there's this perception that tapes have to be filled up. From our perspective, we always say the best practice is daily. Again, this is closely tied with your recovery point objective (RPO).
If your RPO is 24 hours, meaning that I can afford to lose up to 24 hours worth of data (but no more), people say, "I'll back up every day and I'll be OK." That's true only if you send your tapes offsite every day. If you back up every night and you send your tapes offsite every week, you can lose up to a week's worth of data. People don't necessarily think about that. They think they're backing up every night; it's good, they're protected. But, should something happen to your entire data center, all of a sudden you're nightly backup is irrelevant and you're stuck recovering from the previous week's backups. Can you recreate this week of lost data? Maybe, maybe not.
It ties back into criticality of the data and potential losses. So, if you stand to lose a lot of money because you lost a week's worth of data, maybe it is worth buying more tapes and sending these tapes offsite daily as opposed to weekly. It's a very important concept that a lot of companies seem to sometimes mix up.
Also, you should never send your unique data copy offsite. That's another mistake. The problem here is we have a single copy. Best practices would dictate that you keep a copy onsite for quick access and quicker recovery, and you send a copy offsite for DR purposes. You should always have two copies of your tape; one onsite and one sent offsite. With the low cost of tapes today, I don't think it's worth neglecting.
Another important point when it comes to best practices with respect to tape and vaulting is the storage conditions. The trunk of your car is not a good vault -- especially if you live in Phoenix. When it heats up to 115 degrees Fahrenheit, your tapes may not be recoverable or readable. A lot of companies will try to store the tapes maybe to another location they have in an effort to save money. Sometimes it's also a security concern, because they don't want the tapes to be in someone else's hands, or they're afraid the tapes may be lost. They try to develop the vaulting process in house. There's nothing wrong with that, as long as the storage conditions at the other location are optimal in terms of temperature and relative humidity -- based on the tape manufacturer's recommendations.
At the risk of sounding like a broken record, this is where technology like backup data deduplication and more specifically, the replication part of it comes to the rescue one more time. If it is not practical or cost effective to send partially filled tapes offsite (especially those 400GB native capacity ones), if the tape storage conditions at your other location are not optimal, or if you don't your tapes in someone else hands, then maybe tape is not the best choice. Backup to disk with remote replication is likely the answer and, when combined with deduplication, is a cost effective alternative to tape.
The terms have been used interchangeably quite a bit lately because of the popularity of disk backup that is really gaining. The thing is, disk backup is technically another form of media that your backup application or solution can write to. Data replication technically should be an exact replica (as the name implies) of your data.
The main difference here is data replication can be acquired as a volume and immediately (or almost immediately) accessed by the application, so it is written in a format that is understood by your database or your email server -- whatever your application is. However, if you backup to disk, the data will be written to a disk device in a format that is understood by the backup application. Therefore, you need to restore the data that was backed up to disk to be able to access it. That would be the main distinction here, and should not be confused, because one is almost immediately accessible (e.g., a mirror or a snapshot), where the backup implies a restore operation before the data can be accessed.
That's an important distinction because now, if we're talking about disk backup, we're talking about faster backup than tape, but not faster restores than data replication. Data replication almost allows you to do almost instant recovery of your data -- a very specific distinction between the two.
The recovery point objective (RPO) and the recovery time objective (RTO) are two very specific parameters that are closely associated with recovery. The RTO is how long you can basically go without a specific application. This is often associated with your maximum allowable or maximum tolerable outage.
The RTO is really used to dictate your use of replication or backup to tape or disk. That also dictates what you will put together for an infrastructure whether it's a high-availability cluster for seamless failover or something more modest. If your RTO is zero (I cannot go down) then you may opt to have a completely redundant infrastructure with replicated data offsite and so on. If your RTO is 48 hours or 72 hours then maybe tape backup is OK for that specific application. That's the RTO.
The RPO is slightly different. This dictates the allowable data loss -- how much data can I afford to lose? In other words, if I do a nightly backup at 7:00 p.m. and my system goes up in flames at 4:00 p.m. the following day, everything that was changed since my last backup is lost. My RPO in this particular context is the previous day's backup. If I'm a company that does online transaction processing -- American Express for example -- well maybe my RPO is down to the last, latest transaction, the latest bits of information that came in. Again, that dictates the kind of data protection solution you want in place.
So both of them, RTO and RPO, really influence the kind of redundancy or backup infrastructure you will put together. The tighter the RTO, and the tighter the RPO, the more money you will spend on your infrastructure.
You could answer that with one word really, and I would have to say "testing." Just "testing." Whatever you do when you're protecting data, whether it's a backup, whether it's replication, whatever it is, make sure that you test what you put in place. Just because the vendor's glossy ad said that the product allows you to restore "virtually in seconds," I wouldn't necessarily take their word for it.
It's a question of being prepared. You have all this beautiful technology, you have all these storage tiers, you have these recovery schemes, you've done everything the right way -- but unless you test all of this on a regular basis, you will always run the risk that it will not work when you need it the most.
Once, I saw a database that was replicated after the production database became corrupt. And, because of a planning glitch, poor documentation or human error, the corrupted copy of the database was restored over the good copy. So, they ended up with two corrupted database copies. It goes to show that documentation, planning and of course testing -- making sure everything works -- is really important.
Many times you will ask clients when you walk into an environment: "How are your backups?"
"Oh our backups are fine -- run every night; successful completion every night."
The next question usually is: "Did you try restoring from all this?"
"Umm. Well, we do the occasional file restore."
Pierre Dorion is the data center practice director and a senior consultant with Long View Systems Inc. in Phoenix, Ariz., specializing in the areas of business continuity and DR planning services and corporate data protection.