Published: 03 Jun 2002
It was shortly after the world trade center towers collapsed that I stood in the lobby of an undisclosed location peering across the Hudson River at what used to be the World Trade Center complex. In its place were blaring lights and huge tractor trailers that were aiding in the search and recovery effort for victims. I was part of a different type of search and recovery effort being lead by Legato Professional Services. Our mission was to establish a backup and recovery infrastructure that would allow one of the world's largest brokerage firms to find and recover their data, while at the same time, continue to meet SEC regulations regard-ing data protection. During this engagement, I witnessed the trials and tribulations this firm experienced during what must have been one of the most strenuous times in U.S. history. Here's my report.
The firm's data center was located in the WTC complex, resulting in the destruction of their local computers and storage due to the dust and debris that fell around the WTC towers. In addition to any live data being destroyed by the collapse of the towers, and because the off-site vendor hadn't yet arrived that morning, the previous night's backup tapes were still in the building, now considered a crime scene.
For those applications that were deemed mission-critical before the attack, data had been mirrored using a Hitachi SAN. Thus, those applications were up and running within hours of the collapse. For the many other applications, however, we had the difficult task of making the necessary data available to each business unit looking to restore its data.
Initially, we needed to get duplicate hardware and software to reconstruct the destroyed production environment so recovery could begin. The supporting vendors (Legato, Hitachi, StorageTek, and Sun) were all great in providing the necessary pieces of equipment. After procuring the hardware to rebuild the backup and recovery infrastructure, the process began by recovering the firm's six Legato NetWorker servers: four for recoveries and two for continued backups. These servers were Sun Enterprise 6500s with 8GBs of memory, four CPUs and direct-attached StorageTek L700 libraries using DLT tapes.
Because the firm used DLT drives, we experienced severe performance problems loading and unloading hundreds of tapes used during the recovery. DLT drives are great once the tape has been loaded into the drive, but they cause problems with an unusual amount of tape mounts, such as in an enterprise-wide disaster. This problem was exasperated by code in the binary command responsible for loading and unloading the tapes, which needed to run atomically. As a result, the multiple requests for loads and unloads during the many recoveries caused several delays. These delays aren't exclusive to DLT tape drives - any tape drive that is designed for increased capacity instead of speed would yield the same dismal results.
Additional problems rapidly surfaced: In our recovery operation, the Legato NetWorker servers weren't only responsible for mounting tapes and updating the indexes, they were also responsible for moving data between the tape drive and the recovering client system. Although it should be understood that a disaster of this magnitude couldn't have been imagined, and getting the recovery systems up and running as soon as possible was at the forefront of everyone's mind, the deployed design negatively impacted the established service level agreements between the recovery team and the business units that they supported.
A better solution would have been to station storage nodes between the recovering clients and the NetWorker server. In such a configuration, the storage nodes would have been responsible for the movement of the data, freeing the NetWorker server of all of the hardware interrupts associated with opening and closing the tape drive and NIC card interface. Thus, the loading and unloading of tapes would have proceeded more smoothly.
Nonetheless, recovery of the six Legato NetWorker servers was completed without incident. Each took approximately five to six hours once the hardware was set up properly.
Lack of consistent IP connectivity
Most of the requested recoveries were completed without incident. Many of the ones that did fail, however, failed because of the lack of consistent IP connectivity and name resolution. For one reason or another, client systems were randomly dropping off of the network. Luckily, the firm's day-to-day management practices included a set of nicely written scripts to test the functionality of the managed client before executing a backup. These scripts allowed us to determine the likelihood of a successful backup before the client was submitted, as well as preserve the Legato server's resources for quality work. The stability of the IP network and name resolution is of the utmost importance to a backup and recovery application. The firm's applications were designed to totally depend on the operating system and supporting environment for forward and reverse lookups of client machines. Lesson learned: It's important to have these supporting services worked out in advance as part of a business contingency plan.
Chaos reigned during the weeks immediately following the attacks. Day-to-day business practices and common sense seemed to be absent at times. Initially, we were all in reactionary mode. That's the very reason why a business contingency plan is helpful after a disaster. Had there been a thorough business impact analysis performed, the firm would not have faced confusion and unwanted results when it started to restore applications. Once the recovery servers were set up, all of the business units started yelling, "Me first." The firm's recovery team could have served its clients better had upper management signed off on an application recovery priority list. Usually, the most revenue-generating or most used applications top the list.
While application data was being restored, updated data had to be backed up. The fundamental submittal mechanism in Legato NetWorker is a group. The group will contain one or more Legato clients that have similar characteristics (i.e., same availability requirements). However, there is a limit on the number of clients that should be in any one group, and the number of groups that should be active at any one time. This limit was directly related to the number and performance of the tape drives configured in each StorageTek L700 Library, and the capabilities of the hardware supporting the Legato NetWorker server. Not being cognizant of these limitations led storage administrators to schedule more clients for concurrent backups than could be supported by the hardware. As a result, many backups failed because they timed out.
The best approach to calculate your backup and recovery infrastructure's thresholds is to determine the number of streams (file systems or volumes) necessary to keep the tape drive streaming at its maximum transfer rate, and then multiply that number by the number of tape drives supporting your environment. The result will give you the total number of streams that can be supported at any one time by your hardware. And depending on the number of streams each backup client is configured to throw at the backup server, the result will also give you the total number of clients that can be active at any one time.
With chaos prevailing, it was more important than ever to coordinate information regarding the status of recoveries and subsequent backups. A Web-based reporting tool would have lessened the burden of fielding user queries directly by the recovery team, often interrupting critical work, and slowing down our main task of recovering the brokerage firm's data.
Another area of concern was the projects that fell through the cracks due to shift turnover. Client backups would fail for one reason or another, and then be restarted without any corrective measures implemented in the interim. And of course, the client backup would fail again for the same reason. Not only did this action delay the successful backup of the client, it also consumed valuable resources from the Legato NetWorker Server with no chance of success. And in a few circumstances, this action even caused the failure of active backups by consuming resources.
In order to recover the backup and recovery server, the recovery team needed to know the volume number and IDs of the tapes containing the server's bootstrap information. Not only was the firm smart enough to send a copy of this report off-site with their tapes, they also e-mailed these reports to an ISP, which gave them access to this critical information from anywhere in the world, and before the paper report arrived with the tapes. This practice gave the firm the ability to prep the recovery environment for the return of the bootstrap tapes, and prepare some sort of written script about what to do when the tapes did arrive.
The firm faced many obstacles in obtaining the necessary tapes for client restores because the area surrounding the data center was considered a crime scene. As a result, special permission was necessary before the firm's tape operators were allowed to enter the building to retrieve the most recent tapes.
However, because the firm used an enterprise-class library, once a tape was loaded into the library, the tape operator didn't have to retrieve and mount the tape every time it was requested by the Legato NetWorker server. To further reduce or eliminate the initial time it took the tape operator to load a tape, it would have been beneficial to the firm if they had a predetermined client priority list in place that also indicated the recovery objective date. The recovery team could then have identified the exact tapes necessary to restore a particular client to its recovery objective date, before the disaster ever happened. Then, arriving on site, operators would know which tapes to preload into the tape library.
Although it appears as if the firm's efforts to manage the recovery of its application data fell short of any real success, the reality is that given the nature of the disaster, and the many facets of our society that was altered due to the scope of the disaster, they faired well in spite of the chaos that surrounded them.
As tragic as the WTC and Pentagon disasters were, America should consider the events of Sept. 11 a wake-up call to our nation. No longer can we rely on the low odds of a disaster happening. We need to fortify our information infrastructures with solutions that will make such a disaster far less reaching than the one we experienced in September.