Managing and protecting all enterprise data


Data determines the right disaster recovery

The key to cost-effective disaster recovery is first placing a value on data, and then selecting the appropriate data protection technologies. A mix-and-match approach can help you get the right disaster recovery without busting your budget.

What is a continuous snapshot?
A block-mode continuous data protection solution that timestamps every snap.
It's typically deployed in a purpose-built appliance using zero array cycles for each snap, having no impact on the storage array's performance.
Each snap captures only the changes and adds it to the previous snaps, eliminating the requirement for complete data image captures and associated storage space.
The application writes to the continuous snapshot appliance as if it is a mirror of the primary storage.
Rollbacks can occur to any point in time before a corruption, deletion or damaging of the data.
All disaster recovery is performed in the background and never has to disrupt the applications.
Recovery is painless and nearly instantaneous.
A viable, low-cost alternative to mirroring.
How much should you spend on disaster recovery (DR)? It's a trick question that few, if any, storage administrators know how to answer. You can easily spend a king's ransom to protect your data, but few companies have that kind of money. The key to cost-effective DR is first placing a value on the data--and understanding how the data's value changes over time--and then matching various data protection technologies to that value.

In an earlier article (see The search for cost-effective disaster recovery), I described how to develop an application/data classification foundation (ADCF) that lays the groundwork for cost-effective DR. This foundation has six steps:

  1. Classify each application and its data into four categories:
    • Mission critical
    • Essential
    • Important
    • Less critical
  2. Determine the required recovery point objective (RPO) and recovery time objective (RTO) for each class of data.
  3. Determine the available DR options per class of data.
  4. Establish each option's TCO for the expected life of the implementation.
  5. Evaluate the skills required at all DR locations.
  6. Match the data, DR options and skills to the budget to determine the breadth of the DR GAP (the difference between the level of DR required and the level of affordable DR, or the difference between the actual level provided and the level required).

Remote mirroring
Remote mirroring provides data accessibility protection for an application using physically separate locations. While similar to mirroring within a RAID array, remote mirroring takes place over MAN or WAN distances. It's usually between storage arrays or storage appliances, and can be synchronous or asynchronous.

Synchronous remote mirroring is the highest possible level for DR RPO and RTO. The RPO is "zero" lost data, and the RTO is typically seconds to minutes. Synchronous remote mirroring does this by neither completing nor acknowledging the local write until the remote write is completed and acknowledged. Additional writes can't occur until each preceding write has been completed and acknowledged. This means local performance is directly related to the performance of the DR remote device; distance is the limiting factor. Remote synchronous mirroring is rarely deployed for circuit distances greater than 160km (100 miles).

With asynchronous remote mirroring, local writes are completed and acknowledged before the remote writes. Asynchronous remote mirroring is a "store-and-forward" technique that reduces I/Os and wait delays, allowing remote writes to fall behind the local writes. This means the RPO for lost data can range from seconds to minutes, and even hours in some cases. Asynchronous remote mirroring is most often utilized when the remote site is a long distance from the local site.

The primary advantage of both synchronous and asynchronous remote mirroring is the minimal (asynchronous) to zero (synchronous) risk exposure in losing data during a disaster. A secondary advantage is the potential for quick data recovery when a disaster occurs. Remote mirroring doesn't require server agents, and it provides heterogeneous server and application support.

Remote mirroring applications are often pricey, the equipment is usually expensive, and it typically requires at least twice the primary disk space and sometimes much more. However, when the lowest possible RPO and RTO are the requirement, remote mirroring is the answer.

Another disadvantage is that remote mirroring doesn't prevent a rolling disaster, data damage, corruption or accidental deletion. If data is corrupted, damaged or deleted at the primary site, it will also be at the DR site. Some asynchronous remote mirroring products timestamp each transaction and allow recovery to a point in time before the corruption or deletion occurred, but they're exceptions to the rule. This means procedures other than remote mirroring must also be implemented to allow for recovery of corrupted, damaged or deleted data. Other disadvantages include lack of support for heterogeneous arrays, no support for internal storage, and nearly no application and file information.

Less-expensive alternatives to remote mirroring can also provide the lowest possible RPO and RTO. They're generally continuous data protection (CDP) products and include time-based continuous snapshots, automated backup, replication of changed data and automated, generational-change distributed backup. They offer a lower TCO than remote mirroring, support heterogeneous storage and provide better rollback capabilities. But they usually require installing and managing agents.

Backup applications copy primary stored data directly from the application server and move it over TCP/IP networks to a local backup server or remote DR backup server. The server then writes the copied data to disk or tape. RPO is the window between backups or incremental backups. RTO is minimally hours, but usually days to weeks.

While backup is the primary DR application deployed in most IT organizations, it also has the highest failure rate. Failures can be attributed to user error, bandwidth issues, throughput issues, tape issues and even application server availability requirements.

The primary advantage of backup is its familiarity--it's a known quantity, both good and bad. Storage administrators know how to deploy and use backup, and the TCO is relatively low depending on the storage environment.

The two key disadvantages of backup are that its RPO and RTO are usually quite high, and backup is a local process. There are exceptions, however. Several backup programs distribute and centralize backup while providing continuous incremental backups, shrinking the RPO considerably. Unfortunately, recovery time is still a lengthy process. Data consistency and usability--the ability to use the backed up data without modification, reordering or re-creation--may also be a problem. Backup programs require server-based agents and backup costs escalate sharply as the environment scales and grows more complex.

Backup products are evolving and improving. Virtual tape, disk-to-disk-to-tape (D2D2T) and massive array of idle disks (MAID) technologies speed backups and recovery times. New types of backup software, such as content-addressable storage (CAS), reduce the amount of data required to back up by sending only changed data and meta tags about data. This significantly reduces recovery times and dramatically increases recovered data usability. Distributed backup eliminates the installation of server agents. These new types of backup have RPOs and RTOs that can be used for critical data.

Replication software replicates data from server to server synchronously and asynchronously. There are incremental and CDP modes. Replicated data travels over TCP/IP networks to a remote server's disk, and then a backup client is needed to move the data to a storage device. RPO for replication is similar to the RPO for storage array remote mirroring, depending on whether it's synchronous or asynchronous. RTO can be a little faster because the DR application servers are already collocated with the DR storage.

Replication software is easy to install and operate. It can run locally and distributed, and because it's server-, storage- and infrastructure-agnostic, there are no hardware lock-ins. Replication software costs are less than those for backup software and much less than storage array-based remote mirroring. Replication has evolved to include application-aware agents, continuous protection and rollback capabilities. One important benefit to replication is data migration. Replication software simplifies the process and replicates only the data that needs to be replicated in a non-disruptive manner.

Replication software can't prevent damaged data from being replicated, and server agents must be maintained and managed. RTO can be significantly increased if there's a single DR server caching the replication from different application operating systems. In the event of a disaster, all data must be recovered and rewritten before the applications can access the data. This is similar to backup. If there's a DR replication server per operating system, the RTO rivals storage array mirroring.

A snapshot provides a point-in-time reference marker to data stored on a storage system. Snapshots are a way to speed RTOs. There are two primary types of snapshots: copy-on-write and split-mirror.

A copy-on-write snapshot stores changes and additions to existing data. Data recovery is rapid in case of a disk write error, corrupted file or program malfunction; however, all of the previous snapshots must be available if complete archiving or recovery is required. A split-mirrored snapshot references all the data on a set of mirrored drives where one is local and the other is local or remote. Each time the snapshot is run, it snaps the entire volume, not just new or updated data.

Snapshot is easy to install and operate. A copy-on-write snapshot provides a short RTO and a relatively slow RPO (data must still be recovered before it can be used). Split-mirror snapshots have a relatively long RPO, but they speed data recovery (RTO), duplication and data archival. One important benefit to split-mirror snapshots is that it's possible to access data offline for tasks such as data mining and offline production data testing. Some snapshot applications provide continuous snapshots and rollback capabilities based on a point in time, which offers faster RTO.

A split-mirror snapshot uses a lot of system resources and will degrade the performance of the platform it's running on while it creates the snapshot. And snapshots can't prevent a rolling disaster of snapping corrupt data.

Sorting out disaster recovery options
Click here for a worksheet about various disaster recovery solutions on the market (PDF).

You can download an Excel version of this worksheet at:
DR hardware platforms
There are four principle hardware delivery platforms: storage array, general-purpose server, purpose-built storage appliance and the intelligent storage networking switch. The storage array is a purpose-built storage server for block or file-based storage. Many storage vendors provide optional storage array DR software, which includes synchronous and asynchronous remote mirroring and snapshot. These software products are typically specific to the individual vendor and its storage offerings.

Storage array-based software usually doesn't require application server agents. The arrays are server operating system-agnostic and the DR applications run fast. They are also installed in thousands of locations, and are proven and mature.

However, the array DR applications don't work with heterogeneous storage. In general, they don't have file-level or application awareness. (Array applications with application awareness use agents.) Storage array IOPS and throughput decline while DR applications are running. And these DR applications are licensed and managed on a per-array basis. Storage array DR applications have some of the highest TCOs and, in some cases, consume more raw storage than non-array based alternatives.

General-purpose servers have very low acquisition costs and low TCO. Implementing, servicing and managing them are known quantities. Performance is tunable and DR application performance leverages ongoing improvements in server technology. Increasing performance or scalability may be as simple as buying the next-larger server, and more memory and processing power. Other advantages include support for heterogeneous storage, and application and file-system awareness. General-purpose servers require DR application agents.

The purpose-built storage appliance is nothing more than a DR application optimized server. A good way to think of the purpose-built storage appliance is to view it as a networked storage controller. It leverages technologies specifically optimized for storage DR applications. Optimization includes I/O performance, throughput, scalability and high availability (no single point of failure). TCO is definitely lower than for the storage array or intelligent server, but the purpose-built appliance is proprietary. It may also have higher initial acquisition costs and may not keep up with server technology advances.

The intelligent storage networking switch is a relatively new DR delivery platform. The storage area network (SAN) switch is the ideal system to provide DR applications because it sits between application servers and their target storage, and it also has visibility into all servers and storage targets.

There are two principle types of intelligent storage-network switches. The first essentially integrates the purpose-built storage appliance as a server blade into a Fibre Channel SAN switch or director. The second packages it as a storage software delivery platform that just happens to use switching as part of its architecture. It leverages a new technology called split path acceleration of independent data streams (SPAID). SPAID improves performance by separating the control path (the slow path) from the data path (the fast path). It enables out-of-band virtualization without requiring server agents and runs most DR software applications without any changes. Initial costs and TCO will probably be much higher than for non-integrated systems.

No other platform has the DR application performance potential of the SPAID intelligent storage networking switch. SPAID switches have an inherently higher level of reliability, availability and serviceability than storage appliances because of the separation of control path from data path. Unfortunately, there are only a small handful of products that use the SPAID architecture. These include software from Incipient Inc., Maranti Networks, StoreAge Networking Technologies, Troika Networks Inc. and Veritas Software Corp. Of these, only StoreAge has a comprehensive suite of DR applications that works with all of the SPAID intelligent storage networking switches. Maranti has its own suite of DR applications, and Troika is working on a suite with tie-ins to other software-based DR applications. Incipient and Veritas are currently limited to volume management only.

Remember, a cost-effective DR strategy requires a mix of DR applications running on several platforms. Managing cost and effectiveness requires matching the value of the data to specific DR capabilities. This mix-and-match approach will reduce overall DR cost while meeting the organization's needs (see Sorting out disaster recovery options). Of course, this process must be repeated periodically to re-evaluate new technologies, products, SLA requirements and compliance regulations.

Article 13 of 19

Dig Deeper on Disaster recovery planning - management

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.

Get More Storage

Access to all of our back issues View All