Managing and protecting all enterprise data


Disaster recovery relief

The cost of disaster recovery tools can be even more than the value of the data that these very tools are supposed to be protecting. Fortunately, newer approaches to DR are restoring sanity to this high-pressure task.

Problems with disk-to-disk split mirrors
The limited protection and recoverability of point-in-time snapshots
  • The time between snapshots increases the data risk exposure
  • More snapshots reduce exposure, but also reduce inefficient use of additional storage (requires two to nine times the original storage space)
Storage vendor array lock-in
  • Must use the same--or a more expensive--storage array as the remote replication or snapshot catcher
The performance across a TCP/IP WAN
  • TCP/IP WAN issues (congestion, BER, jitter, latency) can reduce throughput performance by one to two orders of magnitude, reducing backup window success
Everyone is aware that in an era of natural disasters, terrorism and regulations, disaster recovery (DR) and business continuity processes are essential. But the high cost of DR and business continuity tools prohibits many organizations from implementing them. In fact, the cost of these tools is often greater than the value of the data that's being protected. Expensive DR tools put you in the precarious position of having to provide enough DR and business continuity to satisfy regulations, but not having enough to meet your real needs.

Compounding the problem is that many storage administrators don't know how to effectively implement traditional DR and business continuity solutions. Their organizational policy may be to back up to tape once a night, and perhaps even have that tape picked up by a service every day. Of course, there's a good chance that the operator didn't change the tape when the job said it was full, overwriting previous data. Fortunately, numerous storage vendors are starting to develop simpler, innovative and more cost-effective DR and business continuity products.

DR pain threshold
Cost, complexity and effectiveness all contribute to how high or low the pain threshold is for a specific DR solution.

Cost: The most common complaint about DR practices is that they require significant capital and operating investment. Disk-to-disk or split mirror products require the capital expense (CapEx) of a complete duplication of storage. When multiple split mirrors or point-in-time snapshots are implemented (such as EMC Corp's Symmetrix Remote Data Facility and TimeFinder), the CapEx increases exponentially. Add in the ongoing operating expenses (OpEx) of software licenses, maintenance, personnel time, training and network, and the costs can quickly overwhelm the value of the protected data.

Even traditional tape backup products can prove to be far more costly than the data they protect. The CapEx may be lower (tape drives, automated tape libraries and tape media) than disk-to-disk solutions, yet the OpEx is significantly greater. This is because the amount of human errors, failed backups, gaps in the data, lost tapes, broken tapes and missed backup windows are typically much higher than is generally acknowledged.

Complexity:If cost is the No. 1 DR and business continuity complaint, complexity is a close second. A survey of more than 200 enterprise and small- to medium-sized businesses conducted last year by Dragon Slayer Consulting, Beaverton, OR, revealed a startlingly high level of user-perceived complexity and frustration with tape backup and recovery (see "DR is too complex "). There were similar levels of frustration (53%) with disk-to-disk split mirroring.

Effectiveness: Effectiveness defines how well a goal or task is completed. For DR and business continuity, this breaks down into two main points: the effectiveness of the data protection and the effectiveness of data recovery.

There's a significant difference in the level of pain between traditional tape backup and recovery and disk-to-disk split mirrors. Only 37% of the tape users in the survey were satisfied--or somewhat satisfied--with the effectiveness of their solution, whereas the satisfaction numbers for the disk-to-disk split mirror users was 68%.

DR is too complex
Rank your satisfaction with your tape backup and recovery system's complexity level:

Rank your satisfaction with your disk-to-disk split mirror system's complexity level:

Source: Dragon Slayer Consulting

New DR solutions
Many storage vendors are bringing innovative products to the market that aim to reduce the cost and complexity of DR and business continuity while increasing its efficiency. The most promising of these technologies are:

  • Continuous replication and continuous snapshot
  • Server-to-server storage volume replication
  • TCP/IP WAN acceleration for storage applications
  • Virtual tape automation
Continuous replication and snapshot: Continuous replication and snapshot delivers on the promise of split mirrors and replication without most of the expense or technical deficiencies. It's ideal for database management systems (DBMS) and for the common corruption events that occur. The key to the continuous replications/snapshot technology is that it adds time as a stored dimension. Recoveries of data can occur from any point in time, eliminating the between point-in-time snapshot risk exposure. Continuous replication/snapshot saves more than half of the costs of traditional disk-to-disk split mirrors by only replicating the changed data. This reduces the disk storage requirements and the tape backup or archival by more than 50%.

Continuous replication and snapshot works simply and elegantly (see "Continuous replication appliance"). It's commonly deployed in a storage appliance (a remote access server) with no single point of failure on a storage area network (SAN). There are no internal hard disk drives in the appliance because it's designed to use the disk arrays on the SAN. It operates at the block level, and is able to protect applications that work on file systems, DBMS and even raw disk partitions. The appliance is configured as just another mirror to the volume managers and looks like a set of disks or LUNs.

When a DBMS corruption occurs, the DBA tells the appliance to provide a data image at a point in time before the corruption occurred. This is presented immediately as a set of volumes for the DBMS. It can then be tested to make sure that this view of the data is prior to the corruption. If it isn't, the DBA has the appliance provide the data image at an earlier point in time. This can be done repeatedly until a valid non-corrupted image is found.

This image then restores the data instantaneously to that point in time, stands in temporarily for the DBMS application while quietly resynchronizing the primary volumes with the restored data. Once resynchronized, the primary volumes again are automatically relinked with the DBMS and the continuous replication/snapshot resumes. Recovery is quick and painless.

Revivio Inc.'s Continuous Protection System (CPS) and Alacritus Software's Chronospan offer continuous replication/snapshot. EMC Corp., Hewlett-Packard Co. (HP) and others are working on their own versions of this technology.

One real financial benefit to this technology is that it allows expensive storage to be protected by inexpensive storage from the same or different vendors. It also frees up many of the split mirror volumes and makes them available as primary storage.

Although continuous replication/snapshot reduces DR and business continuity costs and complexity, there's a downside to this technology: limited distance. It's primarily a campus solution and doesn't yet work across long WAN distances. This means out-of-region disasters require other solutions.

Server-to-server storage volume replication: This technology is beginning to emerge as the 21st century's replacement to backup and recovery. It is software installed on each server hosting an application with critical data requiring protection. It's installed on a central or remote server that acts as the "catcher." Think of it as a hub and spoke arrangement where the catcher is the hub.

Storage-to-storage volume replication works by replicating live disk-based data from each server to the catcher across any available TCP/IP network. The replicator duplicates data while preserving the original write order in near real time to assure integrity between the two systems in the event of a disaster. The replicated data, frozen at a specific point in time, can be made available for reading to applications on the catcher system. Should a failure occur on the primary system, the catcher systems can provide immediate access to contemporary business-critical data.

Continuous replication appliance

With a continuous replication appliance, write data is time-stamped and the appliance is always ready to instantly restore to any time and any place.

One key aspect to this technology is the ability to control all of the servers from a central console. No IT personnel are required at the server, allowing for "lights out" DR and business continuity at remote sites. Some vendors (such as Constant Data Inc. and Fujitsu Softek) even include continuous replication functionality. Again, allowing continuous replication of only the data that changes reduces the amount of disk and tape storage required.

Server-to-server storage volume replication products include Constant Data Constant Replicator, EMC/Legato Replicator, NSI DoubleTake, Softek Replicator and Veritas Volume Replicator; and still others are emerging. Vendor operating system support varies, but most support Windows and Linux.

Server-to-server storage volume replication is a simple, low cost and effective DR and business continuity solution. It works with direct-attached storage (DAS), network-attached storage (NAS) and SAN storage, regardless of the vendor. It scales from the small to medium business to the enterprise customer. And it allows replication from high-cost storage to low-cost storage.

There are two downsides to this solution: The more servers and applications requiring data protection, the greater the license fees; and the dismal throughput over long-haul TCP/IP networks caused by congestion, bit error rates (BERs), jitter and latency. The first can be handled through vendor negotiation. The second is a more difficult issue and may require the implementation of TCP WAN accelerators.

TCIP/WAN accelerators: Congestion, BERs, jitter and TCP/IP latency all cause throughput degradation in a TCP/IP network. Add longer distance, and you've got degradation that can be so extreme that a typical throughput of a DS3 (rated at 45Mb/s) is approximately 5Mb/s or 625Kb/s. This means that a tiny 30GB replication or backup can't be completed within an eight-hour window--it would take at least 13 hours and 20 minutes to complete. Even if the bandwidth is increased to OC3 (155Mb/s) or more than three times a DS3, the window still can't be met. The throughput doesn't increase anywhere near the same percentage as the bandwidth. Depending once again on the congestion, BER, jitter, latency and distance, it may barely increase at all. The historical throughput increase can be expected to be less than 50%. Even with this increase, the eight-hour window is still missed.

TCP/IP WAN accelerators for replication

TCP/IP WAN accelerators used for replication boost performance by three to 400 times the normal network speeds by shielding against bit error rates, jitter and TCP latency.

A number of vendors have enthusiastically attacked this problem and are now delivering cost-effective products. Each vendor's technology is innovative and varied. Technologies employed include packet-reordering, performance-enhancing proxies, compression, QoS and sophisticated flow control. The net results are impressive, with performance throughput increases between three and 400 times the normal network speeds. (See TCP/IP WAN accelerations for replication".)

The way TCP/IP accelerators work for DR and business continuity is simple, and requires nominal skills. First, an appliance is placed as a gateway or proxy on the LAN with the replication sender. Then another appliance is placed on the LAN at the location of the replication catcher. The appliances intercept and transmit all packets between the sender and catcher.

The products can be divided into two categories: those that work up to and above DS3 bandwidth (up to OC3, OC12, etc.) and those that only work up to DS3. Vendors in the "up to or greater than DS3" category include NetEx Software's HyperIP and Orbital Data's IP Express. Vendors in the DS3 and below category include Expand Network's IP Accelerator, ITWorx's NetCelera, Peribit's Sequence Reducers and Riverbed's Steelhead.

The one disadvantage to these solutions is the potential cost of putting appliances at a large number of remote sites. However, the savings in bandwidth costs and increased ability to hit DR and business continuity windows usually offsets the additional appliance and software license costs.

Virtual tape automation: Virtual tape has been around for years. Its success has been mostly in the mainframe market, where it's used as a nearline storage device that manages less-frequently needed data. Virtual tape appears to be stored entirely on tape cartridges, when some parts of it may actually be located in faster disk storage. By doing this, virtual tape makes tape more efficient and faster.

This never really caught on with the Linux, Unix and Windows markets because in those markets, tape is used primarily as backup and archival media. Automation makes virtual tape a low-cost methodology to increase the performance of DR and business continuity while lowering their TCO.

Automation works by making low-cost disk--usually parallel or Serial ATA--look like an automated tape library to the tape backup solution. This increases the backup speed to the speed of the disk array. From there, the backup data from one or multiple systems is written to tape automatically. Recoveries are faster because the data is organized more efficiently on the tapes and is easier to locate. (See "Automated virtual tape".)

Automated virtual tape

TCP/IP WAN accelerators used for replication boost performance by three to 400 times the normal network speeds by shielding against bit error rates, jitter and TCP latency.

This solution reduces the cost and complexity and increases the effectiveness of local DR and business continuity solutions. Products include ADIC's Pathlight VX: Integrated Disk-to-Tape Backup solution and Maxxan with FalconStor IPStor 4.0 Virtual Tape running within its MXV320 and SVT 100 intelligent switches. The ADIC solution includes the SATA array and the tape library. The Maxxan solution requires a disk array and tape library. These reduce costs and complexity, and increase the effectiveness of the DR and business continuity processes. But they don't address the TCP/IP bandwidth issues previously discussed when backup or replication must be carried out over long distances.

Leveraging these new solutions will require time, planning and effort. None of these products are a panacea. The applicability of any solution will vary by organization. Curing the pain may require the implementation of multiple solutions solving different pieces of the puzzle and working together.

How to proceed
Before evaluating, implementing or deploying any of these or other products, it's necessary to conduct an internal audit of your storage environment. Identify the sources of your disaster recovery pain. This may require the implementation of a storage resource management (SRM) tool. Once the pain is diagnosed, determine which product or products will eliminate or significantly reduce that pain. Design an evaluation pilot program to provide quantifiable benchmarks to measure against expectations. If the benchmarks prove to meet or exceed expectations, develop a sane deployment rollout program. Make sure there are measurable milestones and benchmarks along the way. Don't be afraid to require the selected vendors to provide SLAs. The SLAs should provide guarantees and financial penalties for failure to achieve those results. This process is not for the faint of heart; however, the relief of chronic pain is well worth the effort.

Article 8 of 17

Dig Deeper on Disaster recovery planning - management

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.

Get More Storage

Access to all of our back issues View All