Disk-based backup and remote disk-based data backup have become important tools for companies that want to get their data offsite but have short recovery time requirements. Pierre Dorion, data center practice director at Long View Systems Inc., discusses using disk-based backup for disaster recovery in this FAQ. His answers are also available as an MP3 below.
Table of contents:
>> Benefits of disk-based data backup for disaster recovery
>> Differences between synchronous and asynchronous replication
>> WAN optimization and remote disk-based backup
>> Deduplication and remote disk-based backup
>> CDP and remote disk-based backup
Also, there's no media handling with offsite backup, so with tape out of the picture, we have the ability to start using replication technologies to send data across the wire instead of having to rely on physical tape movements. Also, disks are random-access devices, so they have no mount points. When you're using a tape, you have to wait for the media and the tape drive before you can access the data, whereas with disk, once the application accesses the data, it opens up multiple access streams. Offsite disk backup has the big advantage to leverage replication, too, but technologies such as snapshots and block-level backups can also capture a portion of the data that's changed. One downside to consider about offsite backup is that it costs more than using tape.
Synchronous replication is the ability to copy data from one place to another, or to write it simultaneously without any kind of delays. It counts on a return code from the target end to say that the data was written before it resumes. Asynchronous replication doesn't depend on that return code. In other words, data isn't written simultaneously -- it's written to one device and then buffered to memory and written to the target device at a later time. This can vary from milliseconds to seconds to minutes depending on the cases.
There are two different types of replication due to latency issues. Latency issues come about if there is network connectivity between two locations. The longer the distance between locations, the more hops you may run into, especially in the IP replication over an IP network. A good example of latency is when you're trying to look at the properties of a file on a file server. Although you aren't transferring the file, just the properties, there's a delay in getting that information. So these latency issues usually force people to move to asynchronous replication, which means you aren't gathering a simultaneous copy. A lot of companies are implementing a local synchronous replication, so they can obtain a backup copy immediately, and then with that local copy, you can create a remote copy asynchronously.
WAN optimization eliminates redundant transmissions of data. It stages data to local caches and compresses it. Furthermore, there's a data deduplication element that eliminates the transfer of redundant data across a WAN. With deduplication, once it identifies something it already has, it will just send a reference to it instead of the entire byte sequence, which allows you to send a lot less data across the wire, while keeping the transmission coherent. The compression aspect of WAN optimization will look for data patterns that can be represented more efficiently. So deduplication and compression are combined. Essentially, it allows you to send a lot more information across the WAN without using a lot of bandwidth by optimizing what you are sending.
Data deduplication is very similar to WAN optimization. At the disk level, it recognizes identical data segments and creates references and pointers to it. If it sees a segment of blocks or sequence of bytes, it will write to disk a much smaller pointer instead of the entire data segment. Data deduplication is a good replacement for tape media without spending a fortune on disk. Also, if you're considering using disk as your backup target instead of tape, instead of having to buy a lot of disk space, it allows you to reduce your data.
The one thing you need to watch out for with data deduplication is that it's not very well suited for short-term retention backups because the efficiency of data deduplication relies on history. The more it recognizes an identical byte sequence, the more data reduction you get. So, for example, if you back up the same word document 20 times, there ends up being a large portion of that word document that never changes, so instead of having 20 times that file, deduplicating it makes sure there's only one unique byte sequence. Deduplication reduces data storage 15:1 or 20:1. In some cases you could even double that.
A lot of deduplication technologies are replication capable. By replicating deduplicated data, network bandwidth requirements are reduced by copying or replicating a reduced data set across the network. In the end, when data deduplication is combined with WAN optimization, you can significantly reduce your data stream to two locations.
Some people confuse CDP with mirroring, or your traditional data replication. Continuous data protection is a point-in-time copy. Any kind of traditional backup will do a point in time copy whether it's to tape or disk. At the other extreme, mirroring is a constant copy of all the changes that are made to the data. The problem with mirroring is that if the original copy is corrupted, the replicated copies will also be corrupted. CDP, on the other had, captures every bit of info it changes, and allows the user to create certain recovery points as opposed to traditional backup where you only have one recovery point. In the example of corrupted data, CDP gives the user the ability to roll back certain changes to the point where the data was valid.
Who should use this? These technologies are meant for people who need continuous protection, but also need to recover data at a certain point in time. It really depends on how your recovery time objectives (RTOs) are defined. If you have very stringent requirements and have the ability to roll back in a granular fashion, CDP is better than mirroring, which has its limitations in the sense that it copies blindly without knowing if the data at the source is usable or not.