San Francisco-based ezRez Software, an online travel Software-as-a-Service (SaaS) vendor, runs SIOS Technology Corp.'s SteelEye Protection Suite (SPS) for Linux for high availability failover of its PostgreSQL-based platform. According to Michael Bruttig, senior director of technical operations with ezRez, the suite's LifeKeeper software caught an issue that could have resulted in a global outage.
The ezRez platform allows clients to launch online travel services, such as airline, car and hotel reservations, with the look and feel of a customized implementation. Their clients include AirAsia, American Airlines, American Express, JetBlue, Intercontinental Hotels, LAN Airlines, Starwood Hotels and Resorts and United Airlines.
EzRez's data center is set up in a collocation facility operated by Savvis Inc. The environment is made up of six primary database servers and a dedicated high availability failover server (Dell PowerEdge R610 and C1100), attached to an EMC CX4 Clarion with Fibre Channel. "The LifeKeeper software monitors the LUN assignments so if one server goes down, it moves [that server's workload] over to the failover box," Bruttig said.
"Our data center is on the East Coast, so we rely on Savvis' remote hands service a lot," said Bruttig. "Our cage, at that time, wasn't well labeled, and remote hands moved the wrong server because it was mislabeled."
ezRes maintains separate databases for each of their clients, however, one server contains what ezRes calls a "shared DB"—a resource that all of their clients use. "The [shared] box was unplugged, it automatically failed over, we got a notification from LifeKeeper, but no client calls and no web metrics alerts," said Bruttig. "Our clients monitor us to the minute," he said. "It was probably a 30-second failover. Without the LifeKeeper software, we would have to get into the SAN, move the LUN over, bring up the LUN on the standby box, check the database for consistency, and change the IP on the box." He estimated that it would have resulted in multiple hours of downtime, but could not comment on what that would mean financially for the company.
ezRes has SPS tuned to send SMS and email notifications when a failover happens. "We might actually have it tuned too far," Bruttig said. "We get notifications about connections between all the servers, LUNs, all that good stuff."
After the failover, ezRes ran operations on the failover box until the next scheduled maintenance window, at which point they moved the workload back to the primary server. "There's two ways you can perform the failback [in SPS]," said Bruttig. "There is a command line interface—that's probably the quickest way, because we are PCI compliant and have a lot of restrictions on getting into our production environment. But there is also a really good GUI that's easy to use to move around resources."
As noted above, ezRez's platform is based on the PostgreSQL open-source database management system. SPS' ability to easily integrate with the company’s PostgreSQL-based platform was a major factor in choosing the product. Bruttig also had previous experience using SPS in a position at another organization.
"It didn’t require any file system modification, there were no special scripts to run," said Bruttig. "[Symantec] Veritas [Cluster Server] requires you to switch your file system to their proprietary file system."
SPS is available on either a perpetual license or a subscription basis. Pricing is based on the number of nodes in a cluster. Each node requires a license, and a there is an additional 20% fee per node for support.