After finding that its SAN-based replication lacked granularity and required too much bandwidth for efficient disaster...
recovery, HAPO Community Credit Union switched to hypervisor-based replication to bolster the protection of virtual machines.
The 60-year-old credit union has 330 employees, 130,000 customers and handles $1.2 billion in assets. Headquartered in Richland, Wash., the union has 14 branches within 100 miles and a backup data center 40 miles from its headquarters. The sites are connected through 1 Gb fiber links.
Bill Rausch, software engineering manager for HAPO, said the setup allows HAPO to span IP subnets across sites, so it can fail over and back between the primary and backup data centers for any individual machine.
"All we have to do is turn it off here and turn it on in the backup site," he said. "The machine comes up with the same IP and everything is already configured. We don't have to repoint any clients or anything like that. The clients just re-connect and they're up and running."
But failing over individual machines proved difficult as HAPO became heavily virtualized.
HAPO runs its core financial system on an IBM Power Unix server. Backup and DR for that are handled by specialized software, some written in-house by HAPO's IT staff. The rest of the applications run on Microsoft Windows servers, mostly virtualized with VMware.
HAPO has four Dell EqualLogic SANs in its main site and one larger EqualLogic SAN in the secondary site that it replicates to. It also has identical Nimble Storage arrays in the primary and backup sites.
The credit union had set up SAN-to-SAN replication between the primary and backup data centers, using software that came with the SAN arrays, as well as VMware DR software. But Rausch said that was less than ideal for two reasons: The array-based replication was physical-server-oriented, and it used up too much bandwidth.
Rausch said that to fail over virtual machines to the backup data center, he would have to fail over every virtual machine running on the physical server. "So we didn't have the fine-grain implementation that we like," he said.
And he said the replication process was slow, even on gigabit links. "We had to prioritize which virtual machines had more time-critical data," he said. "We spent quite a bit of time building complicated schedules. We had some machines getting backed up every 15 minutes, others maybe an hour behind, yet others might be two or three hours behind, some might be once a day, and others even less frequent than that."
About six months ago, his reseller suggested he take a look at Zerto Virtual Replication (ZVR), which was designed for virtual machines. Rausch said HAPO did a 30-day trial on a few virtual machines and "it didn't take long to decide this is what we need to do."
Rausch said he was sold on Zerto's ability to fail over individual virtual machines. "You pick an individual machine and say, 'Fail over to the backup site,'" he said.
That will shut down the server on the primary site and boot the backup server at the secondary site. "Five minutes goes by and you have completed your failover," Rausch said. "Now that becomes the primary machine and the software is smart enough to shadow the data storage -- in our case the SAN data storage -- from the secondary site back to the primary site for that virtual machine. So when we're ready to fail back, we can use Zerto to shut down the secondary server virtual machine, bring up the primary virtual machine, and we're back in business five minutes later. And all the data has been backed up and transmitted back and forth in the right direction.
"We could do that on our core system years ago with a one- or two-second delay from our primary to back data center, but we didn't have anything like that for Windows," he said. "Now we've done enough testing to know that our most critical servers tend to be less than 10 seconds behind."
He said ZVR has reduced failover times to minutes or sometimes seconds instead of hours, mainly because it only moves changed data between sites.
Rausch said he uses Zerto to set up protection groups for physical servers that incorporates multiple VMs. "You have maybe an application server, a Web server, a data server on a physical machine, and it doesn't make sense to fail over one of those at a time," he said. "If you're going to move one over, you're going to move the whole stack. So you can take that group of machines, shut them down at the primary site and bring them to life at the secondary site."
Rausch said HAPO has "pretty much redundant everything" in the two data centers and runs some systems out of the secondary data center every month on a rotating basis. That allows his team to test the failover for each system over several months.
"We know our disaster recovery works because we test it all the time," Rausch said. "We probably have run about 90% of our virtualized machines out of the backup data center for testing over the last six months. Most machines get tested every quarter."
This may seem like overkill, considering HAPO is in a geographically friendly area. "In all of the places in the entire world, we live in one of the safest from natural disasters," Rausch said. "We don't have to worry about earthquakes, we don't have to worry about tornadoes, we don't worry about hurricanes -- we don't even have winter blizzards."
So why is he so diligent about DR? "We're primarily worried about server failures, but there's always the airplane-crashes-into-the-building kind of thing," he said. "If there's a fire in the building or whatever, we want to be able to move quickly to our backup site."
He said HAPO has mostly moved off tape for data protection, although it does still use CommVault Simpana software to back up to tape for records that must be retained long-term for compliance reasons.