Real-world DR

Disaster recovery (DR) is an ongoing issue for every storage shop. Building a plan that works and is cost-effective is a challenge. Here's how six companies met their DR challenges.

Backing up to tape
Hartz Mountain Industries Inc., a real estate company based in Secaucus, NJ, initially developed its disaster recovery (DR) plan working with Gartner Inc. Gartner assisted Hartz Mountain with business planning and process design. The result was a basic DR program in which data from satellite offices is replicated across the WAN to headquarters, where it's backed up to tape each night along with the headquarters' data--a total of about 1TB of data--and then shipped offsite. Using 10 DLT tape drives, the company backs up the entire 1TB of data in 4.5 hours.

In the last several years, e-mail has emerged as Hartz Mountain's primary recovery target followed by the usual corporate databases, particularly billing and accounts receivable. Still, the company backs up just about everything, relying on point-in-time copies, replication and dumping to tape to ensure that it can recover in the event of a disaster. Since the initial Gartner planning, Hartz Mountain hasn't engaged in formal business impact analysis. With the exception of the heightened awareness of the importance of e-mail, the program has remained pretty much unchanged in the last several years, reports Jeremy Kahn, assistant vice president of information technology.

At the beginning, the company tested the DR system regularly by assigning logical unit numbers to a server and shutting it off, a manual process. In each test, Hartz Mountain was able to successfully bring back the data from point-in-time copies or tape. In more recent years, the company has dispensed with formal DR testing. Looking ahead, Kahn would like to mirror the data rather than replicate it, but the economics aren't quite there yet. He's waiting until the cost of mirroring comes down a bit further.

To make the backup work, Hartz Mountain uses FalconStor for replication and as a storage router. Replicated copies are backed up to DLT tape each night using Veritas Backup Exec software. Tapes are then shipped offsite.

-Alan Radding

Storage administrators often ask what their peers are doing to solve the same disaster recovery (DR) problems they face. What they really want to know is what's working and what's cost-effective.

Information of this type is usually difficult to make public. Many IT organizations have a policy restricting employees from speaking to the press or others about these issues.

Companies often don't want to bring attention to their problems, even if they've been resolved. To bring you as candid a report as possible, we've withheld some company names in this article. The pertinent facts remain unchanged, however.

Users are also typically tight-lipped about how much they actually paid for a DR solution. Therefore, the "cost to implement" numbers that follow are based on approximate MSRP and estimated operating expenses (OpEx) for the time invested, although all of the companies report that they paid considerably less than the sticker price.

The following case studies have a central theme: Increasing data levels and stricter compliance regulations are forcing companies to look to newer technologies to solve their growing DR and backup pains. The old DR paradigm of backing up to tape and driving the tapes offsite is broken. Companies are increasingly replicating data over high-speed WANs and using incremental backup technologies to meet their service level agreements (SLAs) and backup windows, and to protect their critical data less expensively. Of course, restores are much easier, too.

Compliance crisis
A multinational bank's storage requirements for its worldwide operations and hundreds of distributed locations have been growing at an increasing rate. This put intolerable pressure on its backup/restore systems. The firm's NetBackup and ARCserve systems, from Veritas Software Corp. and Computer Associates (CA) International Inc., respectively, were unable to meet their needs and problems were becoming progressively worse. Backup windows were missed more often than they were met. Simple restores of a single file were taking between five and 10 hours; SQL database restores were taking days to fully restore. The bank estimated that only approximately 70% of its data was protected and recoverable.

These issues led to other problems. Software patch windows were missed because of the time required for backups. Tapes were trucked offsite to a tape-vault service provider, but if a backup window was missed, incomplete backup tapes were shipped offsite.

In addition, backup data security was essentially non-existent. Tape data wasn't encrypted, so anyone could copy/load a tape and have access to highly confidential user information and records--a serious liability as evidenced recently by another big bank's well-publicized loss of tapes containing 1.5 million user records.

It was obvious that the bank would be in serious trouble if it ever had to recover from a disaster. "This was a nightmare of epic proportions," said one bank executive. "And one that was not without its consequences." In an age of regulatory compliance, failure to protect the bank's data and recover it in an adequate period of time put the bank at risk of severe financial penalties. The situation had to be fixed quickly.

To correct its numerous problems, the bank first determined what it needed and then prioritized what to fix first. During the product research phase, storage administrators studied and evaluated products from all the top brand-name vendors and a few startup companies. After much scrutiny, the bank zeroed in on the only solution it felt met its requirements. Backup My Info! Inc., a backup service provider/VAR based in Florida and New York, recommended Televaulting backup-to-disk software from Asigra Inc., a 19-year-old Toronto software firm (see "Pay-as-you-go remote backup"). To ensure data would never be outside its control, the bank elected to license Televaulting from Backup My Info! instead of purchasing the backup/restore service.

Testing disaster recovery
Detroit-based Blue Cross Blue Shield of Michigan (BCBSM) tests its DR plan at least once, and usually twice, each year, reports Tim Kavanagh, process specialist at the health insurer. It runs a series of exercises that test the restoration of the critical data required to support all key applications, regardless of platform. It also tests the execution of online and batch processing of a daily cycle of the applications. Print and mail capabilities are also tested by sending data from the recovery site (an IBM facility in Boulder, CO) to BCBSM's print and mail recovery vendor in Warminster, PA. BCBSM's disaster test typically runs for four days and includes more than 15 groups within IT in addition to the application and users groups, with each group following specifically developed test plans. Tests may also involve the participation of more than 100 individuals.

To determine which data to protect, BCBSM periodically conducts a formal risk assessment and business impact analysis. Based on the results of that analysis, it identifies the applications, systems and data that need protection. Currently, the organization protects the critical data and infrastructure required to resume processing of its critical applications and business functions. These include claims processing, payroll and financial, accounts receivable, drugs, interplan teleprocessing services, membership and customer services, and print and mail services. BCBSM is performing a new risk assessment and business impact analysis this year. When the study is completed, the organization will conduct a gap analysis to determine if and where there are gaps in its disaster recovery plans; it will then modify its DR strategy appropriately.

Key to BCBSM's disaster recovery is the use of DR Manager from Softek Storage Solutions Corp. DR Manager tracks all jobs related to a given application, analyzes the job stream and defines the backup profile automatically. It also automatically generates the correct aggregate data sets corresponding to each application.

-Alan Radding

Asigra's Televaulting has only two main backup/restore components, DS-System and DS-Client. The DS-System software is the centralized repository for backups. This is how the bank deployed it: The DS-System contains the entire set of compressed and encrypted generations of incremental-forever backups from each remote location or laptop. All of the backed up data is stored as content-addressable storage. This means all of the compressed and encrypted files and/or data can be restored with a high degree of granularity. Individual files, complete volumes, database tables, complete databases or even bare metal can be restored from any of the backup generations. DS-System software runs on standard (Linux, Windows, Solaris or VMware) servers without any special hardware.

The DS-Client software is installed at remote sites. This part of the application collects the data to be backed up at the remote location from all of the target application, file, mail and database servers, as well as any included desktop and laptop PCs. The DS-Client maintains only the latest version of each backup; restores of that generation can be made locally without having to access the DS-System.

Multiple DS-Clients can transmit to the same central DS-System. DS-Client backup targets don't have agents; DS-Client uses standard APIs and existing security credentials to remotely log into all backup targets to capture relevant application data and securely manage the transfer to the DS-System.

The DS-Client maintains all current sets of permissions and doesn't require turning the backup targets into shared or mapped drives. It transmits all the data from first-time backups in a compressed and encrypted format to the central DS-System over a TCP/IP connection. It applies AES or DES encryption to backup data in flight and at rest in the DS-System repository. All subsequent target backups eliminate redundant (or common) files, and backs up incrementally or only the changed blocks (delta blocking). The net effect is that the bandwidth required at each remote site is measurably reduced, which is an important cost consideration for any distributed backup/restore program.

Cost was another key factor for the bank. DS-Client licenses are free, while the DS-System is licensed on a "pay-as-you-grow" basis based on the compressed backup capacity stored at the central location and additional advanced features. Essentially, the software is licensed the same way disk is purchased. The bank paid $90,000 MSRP with an ongoing OpEx of $27,000.

Of course, there's no such thing as a perfect implementation. Shortly after installing DS-Client, the bank discovered there wasn't enough memory in the system to collect the data. Doubling the RAM solved the problem.

The bank has deployed Televaulting on the majority of its servers and most of its desktops. It plans to roll it out to its remaining desktops and laptops. The results to date:

  • Backup windows are no longer missed.
  • Bandwidth requirements and WAN costs have declined by as much as 80%.
  • Restores of individual files and SQL databases are completed in minutes.
  • IT resources have been freed up.
  • The reduction in DR costs has already paid for the solution.
  • DR compliance is no longer a worry.

Bank reins in remote sites
A multinational bank uses Asigra Televaulting for distributed backup and restore to disk.

An e-ICP's DR problem
Our second company is Broadview Networks Inc., a New York City-based electronically integrated communications provider (e-ICP). Broadview provides integrated communications solutions (including voice services, data services, dial-up and high-speed Internet services) to businesses in the northeastern and mid-Atlantic states.

It was using EMC Corp.'s Symmetrix Remote Data Facility (SRDF)/Adaptive Copy over the EMC Gigabit Ethernet Director for DR replication between its two primary data centers. These data centers are separated by approximately 10 milliseconds of roundtrip latency. The general rule of thumb in converting circuit latency into distance is that a millisecond of latency equals approximately 100 miles.

EMC and the e-ICP calculated they could meet the DR replication requirements with a 24Mb/ sec fractional DS3 private virtual circuit (PVC).

Unfortunately, they were wrong. Actual measurements showed a best-case effective data throughput of a miserly 17Mb/sec (and as low as 12Mb/sec). To make matters worse, replication requirements were increasing; Broadview wasn't meeting its backup windows and data wasn't being protected. The company determined it now needed effective data throughput of at least 28Mb/sec to meet its backup windows.

Broadview would have to increase the bandwidth allocation to its EMC DR application to at least a whole DS3 and possibly part of another. Even then, there were no guarantees the additional bandwidth would fix the effective data throughput problem, and projected additional bandwidth operating costs were high. Broadview even considered replacing the entire EMC DR solution, but elected not to do that when it realized the throughput issue centered on TCP/IP's fickleness.

EMC's solution was to bring in the HyperIP TCP storage replication accelerator from Network Executive (NetEx) Software Inc., Maple Grove, MN. The HyperIP software runs on a standard Lintel appliance provided by NetEx. HyperIP is usually deployed in matched pairs (although it can be deployed in a many-to-one configuration) and for critical DR, in an active-active fully redundant, highly available configuration. It can be set up as a simple TCP gateway or proxy. Broadview set it up as a gateway.

HyperIP takes in TCP/IP packets from the application over a gigabit Ethernet (GbE) adapter and converts them to an efficient, alternative transport delivery mechanism between appliances. In doing so, it receives the optimized buffers from the local application and delivers them to the destination appliance for subsequent delivery to the remote application process. HyperIP is licensed on a "pay-as-you-grow" basis based on the amount of throttled bandwidth.

HyperIP tracks the acknowledgements of data and resending buffers; its flow-control mechanism on each connection optimizes the performance of the connection to match available bandwidth and network capacity. Because it uses a more efficient transport protocol than TCP/IP, it dramatically lowers overhead. In addition, it dynamically adjusts window size from 2KB to 256KB, allowing optimal replication performance. The result is essentially zero TCP latency and considerable congestion avoidance. The entire HyperIP transport is completely transparent to the storage replication application.

Bandwidth boost aids DR
NetEx HyperIP helped Broadview Networks Inc. overcome bandwidth issues that were stymieing its DR efforts.

A key challenge for storage replication applications running over TCP/IP is packet loss. Bit errors, jitter, router buffer overflows and the occasional misbehaving node can all cause packet loss, which is devastating to effective data throughput. Most networks have some packet loss, ranging from .01% to as high as 5%. Packet loss causes the TCP transport to retransmit packets, slow down the transmission of packets from a given source and re-enter slow start mode each time a packet is lost. This error-recovery process causes effective throughput to drop to as low as 10% of the available bandwidth.

HyperIP mitigates the effects of up to 5% packet loss by optimizing the blocks of data traversing the WAN, maintaining selective acknowledgements of the data buffers and resending only the buffers that didn't make it, not the whole frame. Packet loss for Broadview--although nominally in the .01% range--was having a negative impact on the EMC SRDF effective data throughput.

There were multiple issues with Broadview's implementation. First, because latency was in constant variance, the HyperIP units had difficulty functioning correctly. Once the network settled down, an ATM router port failed. After that was corrected, the Symmetrix began having intermittent GbE port time-out issues because its firmware wasn't up to date. Once the firmware was updated, the ATM router port was fixed and the network stabilized, things ran smoothly. The implementation cost was $120,000 MSRP; the ongoing OpEx is approximately $18,000.

Broadview is thrilled with its HyperIP implementation. Its EMC SRDF/Adaptive Copy effective data throughput ranges between 60Mb/sec to 90Mb/sec on its 24Mb/sec PVC, averaging about 70Mb/sec. The plan is to aggregate other storage replication applications (such as Veritas Volume Replicator) through the HyperIP to take advantage of the additional "free" bandwidth.

State-of-the-art disaster recovery
MasterCard International, the global financial processing company, has been following a formal business continuity program since 1990. Over the last 15 years, it has continually evolved its program as needs and technology capabilities have changed, reports Randy Till, vice president of business continuity. Today the company operates two primary data centers and multiple secondary data centers. Its critical debit card operation is run concurrently at both primary facilities, a coprocessing strategy that provides 100% redundancy. In the event of a problem at one facility, the other continues processing with no discernible impact on the operation. For other key apps, MasterCard duplicates data from one location to the next as frequently as every hour. Eventually, most of its top critical applications will be protected through coprocessing. These systems will have to be rewritten to take advantage of the latest coprocessing technologies.

The company relies on business impact analyses to determine how to protect each system, and has established three tiers of protection. Tier 1 addresses the most critical systems. But even within tier 1, MasterCard has defined four levels of protection based on recovery point objectives. These subtiers are for immediate, zero to four hours, four to 24 hours, and one to three days recovery. Most problems, Till points out, are fully resolved within 72 hours and the organization resumes processing on its primary systems. The other tiers address longer term outages that might require more and different equipment.

The company tests its business continuity capabilities in April and October. Every tier-1 system is tested at least once each year. A test may include as many as 40 tier-1 systems. MasterCard strives for end-to-end testing that tests not only the tier-1 system, but all the other systems it depends on. Tests typically run almost a week, and involve 50 to 70 people. MasterCard is using EMC TimeFinder software for local storage replication and EMC Symmetrix Remote Data Facility software for remote replication.

-Alan Radding

Virtualizing storage arrays
For a large southern U.S. manufacturer, DR recently became a primary issue because of regulatory compliance. Before the regulations, the manufacturer thought of DR protection as nothing more than backing up to tape and shipping the tape offsite. But before it could implement a better DR system, it had to solve a difficult problem between its VMware servers and IBM Corp. ESS (Shark) storage systems.

To get around the Shark's inability to dynamically allocate storage to a logical unit number (LUN), the manufacturer carved its Sharks into 4GB LUNs. It then used the server's volume manager to aggregate LUNs as needed for applications. The workaround was successful until it was moved to a VMware-based environment.

The problem arose when the storage requirements on the VMware servers increased to 2TB. VMware is limited to 128 LUNs, which is more than sufficient in most circumstances. But the Shark workaround was now a roadblock: 128 LUNs multiplied by 4GB per LUN equals a maximum of 512GB--barely one-quarter of the new requirement. The manufacturer could have carved the Shark up again, but that would have meant migrating all the data from the Shark, re-initializing and reformatting the Shark and then migrating the data back--a process that would have been disruptive and time consuming. Another option would have been to rip out the VMware environment and to go back to one server image per platform, but that would have incurred dramatic disruptions and high costs.

The company thought SAN fabric-based virtualization might solve its primary problem and, as a side benefit, provide a cost-effective DR solution as well. It evaluated solutions from DataCore Software Corp., FalconStor Software Inc., IBM and Troika Networks Inc. It settled on Troika's Accelera and SAN Volume Suite (SVS), which includes Troika VMware multipathing fabric agent, StoreAge Storage Virtualization Manager (SVM), multiMirror, multiCopy and remote mirroring over TCP/IP. The SVS is primarily deployed in active-active pairs with each one connected to its own Fibre Channel (FC) switch or director.

The VMware multipathing fabric agent, part of SVS, resides on the Troika Accelera, not the VMware server. The StoreAge volume management, replication, snapshot, mirroring and data migration tools reside on the SVM appliance. The SVM appliance provides the virtualized LUN map to the fabric agent, which directs each VMware server's access to the physical LUNs. Replication over distance is done using TCP/IP from the SVM appliance.

The company experienced only one significant setback, which occurred when it was implementing the high-availability (HA) option. The administrator pulled a series of FC cables to prompt a failover. Apparently, the sequence of cable pulls revealed a bug in the failover code, which Troika then patched.

Another problem that occurred was attributed to operator error. The company had implemented StoreAge server agents on a number of server platforms, including Windows and Novell NetWare, before they were available as "fabric" agents. When new Novell NetWare servers were added, the agents weren't loaded on the servers. This let the Novell servers connect directly to physical LUNs (instead of the virtual LUNs) they shouldn't have had access to and data was corrupted. The error was quickly discovered and fixed, although it took much longer to correct the corrupted data. As a result, the company is looking to migrate all of its server agents to fabric-based agents to prevent similar errors. Implementation costs were approximately $120,000 MSRP for the HA configuration; ongoing OpEx is approximately $18,000.

Virtualization appliance eases replication
A manufacturer uses Troika Accelera and SAN Volume Suite for storage virtualization to VMware and remote replication.

The Troika SVS system solves the VMware/Shark dilemma by presenting 1TB virtual LUNs to the VMware servers. The company expects payback in as little as 12 to 18 months based on the savings vs. current costs for storage provisioning and offsite tape storage. The estimated costs don't include the savings from disk-based DR. The Troika SVS will reduce disk-based replication costs from approximately $90,000 per TB to $10,000 per terabyte by allowing the replicated data to reside on a lower cost disk system such as IBM's DS4000, Nexsan Technologies' ATAboy or EMC's Clariion.

Dig Deeper on Disaster recovery planning - management

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.