Complete guide to backup deduplication
A comprehensive collection of articles, videos and more, hand-picked by our editors
What you will learn in this tip: Data deduplication is no longer just a cool technology, and it's become a fairly
common component of modern data backup strategies. Learn how to leverage data dedupe technology as part of your disaster recovery strategy in this tip.
With products from major storage vendors, it's clear data dedupe technology isn't just a niche technology. You may already have a dedupe appliance or dedupe software installed in your environment as part of your data backup strategy. But deduplication can also play an important role in your disaster recovery strategy.
How data dedupe technology works
Because deduplication is based on the replacement of identical data segments with much smaller "stubs" that point back to a single, unique data segment common to all, it is easy to understand that deduplication ratios increase with history. What I mean by "history" is that the higher the number of similar files subjected to deduplication, the better the reduction ratio becomes because of the increased incidence of identical data segments. This is why deduplication is so popular for use on storage targets for backup and archives -- we typically store multiple copies of files for a specific amount of time, providing ample opportunities for the identification of duplicate data segments.
It's possible to implement deduplication on a primary storage device but the focus of this tip is on deduplication and disaster recovery. It is also important to point out that single-instance storage (SIS) is also considered by many to be another form of data deduplication. SIS discovers identical files or objects and store them only once with a reference to copies. For the purpose of this discussion, I will focus on variable-segment deduplication. I should also point out that deduplication-capable storage arrays are often transparently implemented either as a disk target or tape replacement in the form of a virtual tape library (VTL) when used in conjunction with backup software. There are some exceptions where the backup software itself has a deduplication capability, such as CommVault Simpana or IBM Corp. Tivoli Storage Manager, for example.
Deduplication can be grouped under three main categories in terms of backup and disaster recovery:
1. Deduplication at the source where duplicate data segments are indentified on the host before data is backed up. EMC Avamar and CommVault Simpana are two examples of product offering this capability.
2. Deduplication on the primary backup storage target (typically disk) where duplicate data segment are indentified. This can be done using any of the many products, including CommVault Simpana, which can do both source and target deduplication.
3. Deduplication on a secondary or copy backup storage. This is common in environments where a deduplication-capable array is used as a replacement or complement to tape storage.
Different vendors might introduce variations in their approach such as in-band and out-of-band deduplication, or give their technology a slightly different spin to differentiate themselves but, essentially, the above can be considered the three main flavors. Some backup software also incorporates deduplication capabilities, but in the context of this discussion they can be grouped between source or target dedupe and, in some cases, both, like CommVault.
The differences between source-based and target deduplication are often the object of debate. Source-based reduces the amount of data sent across the network, so this is especially important for remote sites sending data back to a main site. However, target dedupe provides the opportunity to identify more identical data segments by comparing all data in a single location rather than at the individual host level. CommVault and EMC Data Domain, for example, are two products that can provide global deduplication, which means that data is deduplicated on an individual host at the source, or a smaller deduplication array can be replicated back to a single storage device and deduplicated globally.
Disaster recovery implications
So far we have focused on backup, which obviously is an important part of a disaster recovery strategy, but without an offsite copy, backups alone cannot provide full protection in the event of a site-wide disaster causing the loss of both the production data and its backup copies. This is where data dedupe technology can improve the disaster recovery process via remote replication. Deduplication significantly reduces the amount of backup data stored by replacing identical data segments with stubs or pointers as mentioned earlier; this reduced data set can be replicated to a remote deduplication-capable storage array, and requires much less network bandwidth.
In addition, in the event of the total failure of the deduplication storage device at the primary site, deduplicated data at the remote site can be cross-replicated back to the primary site to rebuild the failed storage device. Because the deduplicated data and the pointers are kept together, this self-contained data set is unaffected and ready to use for restore operations on individual systems that used it as a backup target.
Data dedupe advantages in backup and disaster recovery
The main advantages of deduplication in the context of backup and disaster recovery include:
- Reduced amount of disk required to store backup data
- Potential to increase retention for some backups
- Reduced network bandwidth requirement for remote replication (deduped data is replicated)
- Reduced network bandwidth requirement for backups when used at the source
While data dedupe offers many enhancements to a disaster recovery strategy, there are certain things to watch out for. Deduplicated data must still be restored in the event of a disaster. While deduplication leverages disk storage, it isn't the same as data mirroring or cloning, and even if deduplicated data can be replicated remotely, it doesn't become a mirrored volume that can be mounted by a server to resume business processing. Data must be reconstructed and restored to a file system recognized by a server before it can be used.
In addition, if data was copied to tape from a deduplication-capable disk storage array using a standard backup software product, data was likely reconstructed in the process and lost all its deduplication attributes. This means files are back to their original size and number, and must be restored from tape media in the event of a disaster.
As you can see, data dedupe technology isn't just for data backup and primary storage; the benefits of using it as part of your disaster recovery strategy make it worth careful consideration.
About this author: Pierre Dorion is the data center practice director and a senior consultant with Long View Systems Inc. in Phoenix, Ariz., specializing in the areas of business continuity and DR planning services and corporate data protection.