Published: 17 May 2006
Three data replication choices have emerged: array-based, host-based and network-based, with each having its distinct pros and cons. What method is best for your environment?
When it comes to preparing for an eventual disaster, it's better to scatter your eggs rather than fortify your basket. That risk-mitigation philosophy is at the heart of an important storage trend: replicating data to remote locations. And, increasingly, it's not enough for those copies to be offline on tape. Replicated data needs to be readily accessible—online—so that operations can resume as quickly as possible in the event of an outage at the primary site.
"We still do tape backup at night, but we wouldn't want to lose an entire day's worth of data," says Don Moran, database administrator at Hanson Brick & Tile, a brick manufacturer headquartered in Charlotte, NC. The company began replicating the Oracle database that runs its enterprise resource planning system in North Carolina to a Hanson office in Texas approximately two years ago. Disaster recovery (DR) tests have shown Moran that he can get his standby server up and running in less than 15 minutes. "I can go into full DR mode without ever leaving my office," he says.
Storage readers are very keen on replication. According to a survey published in our March issue (see "Snapshot"), the portion of readers doing remote replication has shot up almost 20 percentage points, from 38% to 57%, in the past 18 months. This surge of interest goes hand-in-hand with the growing number of replication products that are on the market. Ten years ago, there were only a handful of replication products, and none of them was cheap. Replication was limited to the domain of those financial firms with the deepest pockets, and the target sites weren't very distant. These days, even small- to medium-sized businesses are replicating select data sets, sometimes thousands of miles away (see "Evaluate your environment").
But even though the economics of replication have changed dramatically, the technology changes have been relatively incremental. In the beginning, replication took place on the host. Large storage vendors eventually began offering replication as a function of their arrays. Today, startups are angling to have more replication take place off the host or array and in the network instead; each approach has its pros and cons.
|Evaluate your environment|
There are many ways to replicate data between geographically dispersed sites. The question is, which one is right for your environment? With so many options, the first step in developing a replication strategy is to evaluate your environment and weigh the different replication products against how they fit into your storage infrastructure. Obviously, each product and replication approach will have its pros and cons.
Many replication sales are accompanied by professional services engagements with tools to help you evaluate your needs. Steve Higgins, director of business continuity and security solutions at EMC Corp., says such tools measure reads, writes, simultaneous users, latency of the termination equipment and server load to help you size your replication environment properly. "It's like we've taken all the really smart people and put them into this tool," he says.
Symantec Corp.'s disaster recovery planning utility is free and can be downloaded from the Web. Called Veritas Volume Replicator Advisor, it runs on Windows and Solaris, and can collect data generated on systems running Symantec's Veritas Storage Foundation (Volume Manager), as well as process data collected by operating system utilities like AIX iostat and lvmstat, HP-UX and Linux sar, and perfmon under Windows. After collecting data for two weeks, it can help you determine your optimal bandwidth needs and, for future Veritas Volume Replicator customers, help size your Storage Replicator Log (SRL) files.
Array-based replication is the most popular way to replicate data, at least for Storage readers. According to the survey mentioned earlier, 50% of respondents said they use array-based products to replicate data remotely. That's because, by and large, array-based replication is:
- Non-disruptive to the application, as the processing is performed by the array
- Application and operating system independent
- Highly scalable
Replication of this sort can do synchronous mirroring, usually good for no further than 200 miles, and asynchronous mirroring for longer distances. In addition, enterprise-oriented replication products have ways of bridging the gap between synchronous and asynchronous solutions by adding in a third data center. For example, EMC's SRDF/Star can be configured to write to two sites simultaneously: one synchronously and the other asynchronously. If the primary site fails, the company can quickly resynchronize the SRDF/S and SRDF/A sites by replicating only the differences between the sessions.
In the past couple of years, most midrange arrays have also started to include distance replication functions. For example, EMC Clariion can be outfitted with Mirror-View, Hewlett-Packard (HP) Co.'s Enterprise Virtual Array (EVA) models with Continuous Access, Network Appliance (NetApp) Inc.'s filers with SnapMirror, and Sun Microsystems Inc.'s StorEdge 6920 with Data Replicator, to name a few. In many ways, replication is becoming table stakes for any midrange array product.
But some observers worry that some midrange arrays aren't sufficiently equipped to run processing-intensive replication. "Vendors reduce the price by using a lesser microprocessor and less cache," says Arun Taneja, president and founder of the Taneja Group, Hopkinton, MA.
Shortly after Sept. 11, New York City-based Folksamerica Reinsurance Co. implemented a new Fibre Channel (FC) SAN from NetApp. Included in the package were SnapMirror licenses that Folksamerica uses to replicate its 50 or so servers to a Qwest data center in Colorado, where an exact replica of its production environment is stored. The data is replicated asynchronously according to different policies, depending on the importance of the system, explains Erik Tomasi, vice president of infrastructure. That keeps bandwidth costs down and has prevented the filers from suffering from a noticeable performance hit, he says.
If there are any downsides to Folksamerica's DR setup, it has nothing to do with NetApp replication. "The biggest drawback of our current environment is the cost of keeping exact copies of our servers out in Colorado, and the challenge of keeping them synchronized," says Tomasi. "In general, replicating data to Colorado is a great idea," he notes, but in the future, he'll investigate ways to make better use of the equipment sitting in a DR site.
But despite its popularity, array-based replication has its critics. The most oft-cited criticism of array-based replication is that it limits the type of storage you can use. With array-based replication, the arrays at the primary and secondary sites must mimic one another. In other words, if there's an HDS Lightning at your primary site, you'll need an HDS Lightning at the replication site with the same amount of capacity. Not only does that limit your array choices, but it may also force you to buy a higher tier of storage for the DR site than is strictly necessary or affordable.
That was the thinking behind foregoing array-based replication at TradeCard Inc., which runs an outsourced supply-chain automation platform for a network of 1,500 business buyers and sellers. Last year, the company embarked on a plan to implement replication of its production and training databases to a DR site approximately 85 miles away from its main data center in New York City. As an EMC Clariion shop, the decision not to buy MirrorView—which offers both synchronous and asynchronous remote mirroring capabilities—came down to the freedom to buy non-EMC Clariion storage in the future.
"We didn't want to be stuck with that array technology. What if we wanted to add a NAS box down the road?" asks Anthony Ercolino, TradeCard's vice president of data center operations. Financially, says Ercolino, MirrorView was also the most expensive product the company evaluated.
That complaint hasn't exactly fallen on deaf ears among storage vendors, who are increasingly devising novel ways for users to replicate to a less-expensive tier of disk, albeit in the same class of array. For example, Hitachi TrueCopy users still have to replicate between Lightning arrays, but the FC volumes at the primary site can be replicated to serial ATA drives at the DR site. The caveat is that you need to architect your remote site to ensure that it can keep up with the volume of writes being sent from the primary array.
If storage agnosticism is your thing, there's always the other tried-and-true approach: host-based replication. Representatives of this approach include IBM's mainframe-based Extended Remote Copy (XRC), Softek Storage Solutions Corp.'s Replicator, Symantec Corp.'s Veritas Volume Replicator (VVR), XOsoft's WANSync and WANSyncHA, the open-source rsync and specialized database utilities.
Because host-based replication occurs at the operating system level, it's entirely indifferent to the underlying infrastructure. "We work on any OS, any storage and any network," proclaims Sean Derrington, senior group product manager at Symantec.
That open stance allows for some pretty nifty replication configurations, including one:one, one:many and many:many configurations—for up to 32 different locations, says Derrington. In May, Symantec will announce what it calls a "bunker mirror." Similar to array-based multisite replication like EMC's SRDF/Star, the Symantec product enables replication to any distance—even out of synchronous range. Like array-based multisite replication products, Symantec VVR's new capability works by simultaneously replicating data synchronously and asynchronously to two separate sites. But unlike array-based multisite replication, it won't replicate the entire data set to the synchronous site—only the storage replication log, which comprises approximately 15% of the total replicated capacity. If the primary site goes down, the replication logs from the synchronous site can then be quickly synchronized to the DR site and business can resume. For a more cost-effective (and some would argue simpler) solution, there are many host-based replication products that function at the file level, such as EMC's RepliStor, NSI Software Inc.'s Double-Take and Symantec's Veritas Replication Exec.
In many ways, host-based replication is array-based replication's polar opposite. Whereas array-based replication functions without any real knowledge of the server or application, host-based replication is intimately aware of the operating system and application particulars. That can be appealing to storage managers who want tight integration with the application—particularly if they're running replication in conjunction with server high-availability software like failover clustering.
"If you want application consistency, you have to participate with the application APIs to enforce consistency at set points in time," says Anup Tirumala, president and COO at InMage Systems Inc., a Santa Clara, CA, company whose DR-Scout replication and continuous data protection (CDP) suite requires a lightweight host agent. The agent, says Tirumala, integrates with the application API to force data in memory to be flushed and committed to disk. Array-based replication products often offer add-ons for specific applications like Microsoft Exchange or Oracle, "but those require an agent, too, and that detracts from their [agentless] value proposition," says Tirumala.
But the big knock against host-based replication is that it consumes CPU cycles on what are typically mission-critical application servers. "The one thing we really didn't like about [Veritas] Volume Replicator is that when you load software on the host you impact the performance of the system," says TradeCard's Ercolino. Furthermore, Ercolino bristled at the Symantec licensing model, which prices VVR licenses according to the class—the number and type of CPUs—of the system.
If array-based replication is expensive and restrictive, and host-based replication puts a load on your servers and is difficult to administer, what's left? For many startups, and even a few well-established players, the network is the answer.
|The CDP connection|
Several replication vendors now offer optional continuous data protection (CDP). For example, last fall, Kashya Inc. announced a CDP module for its KBX5000 Data Protection Appliance that integrates with Oracle and SQL Server through Microsoft VSS APIs. Similarly, InMage Systems Inc.'s DR-Scout comes with CDP built in, and XOsoft offers Enterprise Rewinder for its WANSync and WANSyncHA products.
At some point in the past couple of years, these replication vendors had an "Aha" moment, says Arun Taneja, founder and president of the Taneja Group, Hopkinton, MA. Like CDP products from vendors such as Mendocino Software and Revivio Inc., they were also trapping every write. "Since they have the foundation, they might as well play in the CDP space," says Taneja.
Coupled with long-distance replication, CDP promises to help users get their systems up—and in a consistent state—faster. "It's all about faster RTO [recovery time objective]," says Taneja.
But to be truly effective, CDP must be tightly integrated with the application. "The lion's share of the engineering work done by the Mendocinos and Revivios is about rolling back, making sure that the application is consistent," says Taneja. Whether or not replication players have engineered in the required intelligence to make sure the rollback brings the application to a consistent state remains to be seen.
During the past couple of years a number of replication products have been released that rely in whole or in part on network-resident resources like dedicated appliances and, in some cases, an intelligent switch. Examples include Kashya Inc.'s KBX5000 Data Protection Appliance (see "The CDP connection"), which can be configured either as an appliance or as a blade in a Cisco Systems Inc. MDS switch; InMage's DR-Scout; and mirroring and replication add-ons for the entire coterie of virtualization platforms, including DataCore Software Corp.'s SANsymphony, FalconStor Software Inc.'s IPStor, Sanrad Inc.'s V-Switch and StoreAge Networking Technologies' Storage Virtualization Manager (SVM). In addition, Hitachi offers a sort of array/network hybrid in its TagmaStore Universal Storage Platform and IBM offers replication running on its SAN Volume Controller (SVC).
In larger heterogeneous environments, network-based replication offers the freedom to use any kind of storage, but the replication process itself runs on a separate network resource away from the application servers. "If you've got hundreds of servers, [managing replication on the server] gets pretty dicey," says Zophar Sante, vice president of marketing development at Sanrad. "Plus, everyone hates agents—they're a pain in the tuchus."
The question of agents is one of the reasons that Kashya, in addition to an appliance-based product, has architected KBX to run agentless on the Cisco MDS 9000 switch and, eventually, on Brocade's SilkWorm Fabric Application Platform. Running replication on the switch simplifies support, says Rick Walsworth, Kashya's vice president of marketing. "If you have hundreds of servers, you don't want to have to install a splitter driver on each one," he says. Furthermore, a lot of companies have strict no-agent policies for their production servers. "The driver does make it a more difficult sell upfront," says Walsworth.
But not always. TradeCard's Ercolino thought briefly about using the Kashya replication product on the Cisco MDS, "but that's an expensive switch," he says. He hasn't seen any performance impact on the system from loading the splitter driver, but he'd consider upgrading to the Cisco switch version if he sees a lag.
Even if some sort of host-resident agent is required, network-based replication succeeds in offloading the bulk of the processing from the host while giving users the freedom to replicate between dissimilar storage.
Take, for example, Santa Clara, CA-based startup Topio Inc. and its Topio Data Protection Suite (TDPS). In a nutshell, the idea is to take the brunt of the processing off the primary system and to perform the "heavy lifting" on a dedicated server at the remote site, explains Chris Hyrne, Topio's vice president of marketing. TDPS requires an agent on the host at the primary site, but, Hyrne insists, it's an extremely lightweight agent. Specifically, the agent's job consists of intercepting writes, time-stamping them and sending them over the wire to the recovery server, where the writes are reassembled in order.
That architecture works well when replicating applications that consist of multiple servers, for example, an Exchange server plus its associated domain controller, says Hyrne. To recover the application, "you need to recover both of those servers to the exact same consistency point," he says. Hanson Brick & Tile's Moran is currently using TDPS to replicate a single Oracle database, but is considering extending it to replicate Exchange and some software development servers.
With the additional processing power that comes from inserting a device in the network, replication startups have also experimented with including features that might otherwise consume too much CPU in an array- or host-based configuration. For example, Kashya uses storage and bandwidth-reduction technologies that can cut down the amount of data that travels over the wire, which keeps telecommunications costs down.
InMage's Tirumala says another benefit of having a network-based appliance doing the replication is to insulate replication from WAN outages. "If you're taking a traditional host-based approach, the deltas are buffered by the host; if there's a traffic issue, that can cause problems," he says. "With us, our appliance is doing the buffering outside of the production environment."
Running replication from the network has a lot of advantages, but market forces may slow down the movement to bring more intelligence to the network. "I firmly believe that, ultimately, replication will be done primarily from the network—that is where it makes the most sense," says Taneja of the Taneja Group. But entrenched array and host vendors' existing replication businesses are "too large" and the margins "too juicy" for them to actively push alternative approaches. The transition, says Taneja, "is going to take a lot longer than logic would dictate."