Published: 10 Nov 2003
One of the main objectives of an organization developing a disaster recovery (DR) plan is to make the plan specific, yet simple enough so that a storage or system administrator who isn't familiar with the organization's day-to-day processes can walk into the recovery site and restore operating system and application data without the organization's IT staff being available for consultation.
That notion was put to an extreme test after Sept. 11, 2001. Consider Cantor Fitzgerald, a financial services provider that lost more than 700 employees in the World Trade Center (WTC) disaster. Many of those employees were the IT personnel. For any organization put in that position, the resulting question is: "Can your applications survive this kind of human loss during a disaster?" I recently gained some insight into that question and more during a DR experience with Sungard Recovery Services in Philadelphia on the second anniversary of the WTC disaster.
Sanology Inc. was called by a partner consulting firm two days before the DR test at Sungard to provide a resource for one of their clients and to fill in for a departing consultant who had been working on the project for months. The DR test was going to span the anniversary date of Sept. 11, and I was wondering how this large insurance company would fare in the recovery of its data on such a momentous date, considering that terrorists have a history of reminding their enemies of significant dates with further attacks on the very same day.
Being a curious creature, I took the assignment and discovered that in spite of my concern about the significance of the Sept. 11 date, the Sungard recovery site was not exactly jumping that day with IT organizations exercising their DR plans. This was interesting because we are a computer- and network-dependent nation that needs to ensure we have fault-tolerant resources to power our applications should an attack on our infrastructure occur on the Sept. 11 anniversary date repeatedly.
Before arriving at Sungard, I had the opportunity to speak with some of the administrators involved in the DR exercise from the beginning. They provided me with such pertinent information as the recovery objective date, operating system types and levels, as well as the hardware and software configurations of the Legato Systems backup and recovery environment. Armed with nothing more than a diagram of the provided information and a contact name, I showed up ready to go.
The Legato recovery environment had already been set up by the insurance company's core staff and was composed of a StorageTek L700 library using eight 9940 tape drives provisioned over a storage area network (SAN) and discovered by two Sun enterprise-class servers acting as NetWorker servers. StorageTek's tried-and-true ACSLS software allowed the two NetWorker servers to share the robotic arm of the library and permit DMZ and non-DMZ applications to be recovered simultaneously.
After certifying the recovery environment and being introduced to the principles at the insurance company, I started fielding recovery requests without having intimate knowledge of the applications I was recovering.
While troubleshooting tape drive failures, I realized that the method used to connect the tape drives to the Legato servers wasn't known by the core staff. Apparently, when communicating their hardware needs to Sungard, the insurance company didn't specify a method (i.e., a SAN or direct-attached storage) in which that connection was to take place, only that they would need eight 9940 tape drives. But as long as there are no problems with connectivity or the storage device itself, there isn't a real need to know how the storage device is connected to the server, only that the amount, type and connection speed of the requested storage is in fact provisioned.
Sungard's approach to provisioning storage over the SAN for DR exercises could prove to be beneficial to your recovery efforts in the time it takes to ready your environment. Imagine large, multiple SANs composed of servers and storage that's being managed by a drag-and-drop application allowing administrators to simply drag and drop storage resources onto a server icon, while updating zoning and LUN masking information in the background. Automating a process like this could have your production SAN environment duplicated and primed for recovery faster than ever before.
Provisioning IP addresses
As is usually the case, IP connectivity was a problem from the start of the recovery effort. However, after a meeting of minds, these issues were permanently resolved. During problem resolution, I discovered that the insurance company asked Sungard to resurrect its IP infrastructure with the same address nomenclature (e.g., IP address and netmasks) as its production LAN. Although I understand the reasons behind this decision are that they didn't want to change their IP nomenclature at the host level and that they wanted to restore their DNS server, resurrecting the production LAN as it existed before the disaster always seems to delay the recovery effort.
A better approach would be to request a bulk number of IP addresses at the necessary link speed; assign those IP addresses to the recovering hosts and then create a master hosts file to be placed on each recovering host.
Although this may require more work, this file can be created as part of a deliverable from your DR provider before you arrive on site. And if you foresee a need for a DNS server for future host additions in a real disaster, there's plenty of free nameserver software that takes a host's file and creates the DNS database files from its input.
In order for that to work, configuration files in your production LAN shouldn't be peppered with IP addresses. Use only the hostnames of servers running application software to make contact with daemon processes running on that server. This way, connectivity remains consistent during the move to the recovery site because all references to network services will be by name and not by a changeable IP address. And if you think about it, isn't it IP connectivity that we want to prove and not necessarily that we can move our network nomenclature?
Floppy disk fallback
The most resource-independent network around is what's generally referred to as "sneaker net." This is the process of moving data from one secured network to another by an administrator physically removing storage media from a computer on one network and walking it over to a computer on a second network. Although this could be a tape or disk device, the usual medium is a floppy disk. And in a DR exercise, the two networks are probably a private connection to your production environment back home and the DR provider's onsite LAN that you have provisioned for the disaster recovery.
Some files were unrecoverable due to tape errors, and the administrators resorted to transferring the files via FTP from their environment back home. However, because the PC that connected them to their network back home didn't have visibility on the LAN at Sungard, they had to use floppies to transport the files to the recovering servers over sneaker net. The problem was that blank floppies weren't readily available because no one foresaw the need to FTP files from their production LAN. So a search party was formed to locate a few floppy disks.
In preparing for a disaster and considering the real possibility of this scenario, removable storage devices and media should be included in your off-site canisters as a fallback. Optimally, this removable storage device should be able to store the largest of your client files, and be discovered by both the computer connecting you back to your production LAN, and a computer on the DR provider's private LAN.
Take nothing for granted
Assume the administrator coming in off the street knows nothing about your backup and recovery processes at home. Therefore, documenting everything from how to connect your application data to your recovery server--whether it is in the form of a tape library or disk array--to how to perform command line restores of file systems and databases should be included in your DR plan. To test your plan, have someone from another group who is IT literate, yet unfamiliar with your group's storage practices, execute the plan. This should give you some idea of how an outside consultant would fare in resurrecting your applications.
One last note on this subject is that all procedures should be standardized, documented and well known by the recovery team. The insurance company's resident Legato administrator was well versed in his knowledge of the Legato application. And as a result, it appeared as if he had been given full reins of the environment. This scenario often leads to administrators creating procedures and walking around with them in their heads, and this administrator was no exception. Make your procedures well known with public documentation. Not sharing this information doesn't give you more job security. All it does is require the operation support staff to call you in the middle of the night.
A prioritized application recovery list was nonexistent during this particular exercise. In the absence of this list, application administrators simply fired up recoveries at their whim. Because there weren't a large number of servers to recover, tape drive resources weren't overrun. But when application administrators required my attention to help perform their recoveries, I wasn't able to immediately assist other application owners who, by the attention being given to them by the DR coordinator, owned more important applications. Having a recovery priority list helps shift resources based on a business objective approved by upper management, which leads to a more successful recovery effort.
Testing application recovery on the anniversary dates of historic terror events also helps promote a more successful recovery effort. That's amplified when you consider that Sungard DR policies are based on a first-come, first-serve model with regard to the allocation of hardware and floor space. With this policy in mind--which is primarily the same for service providers in this space--how do you know your applications will have a home once a far-reaching disaster strikes?
Even as I am writing this article, Hurricane Isabel is pummeling the East Coast of the U.S. and its projected path is very wide. Should a considerable amount of flooding take place, multiple data centers in Isabel's path could be affected by this storm. And although a natural disaster's target is less symbolic and has more to do with location (i.e., ocean-side states and fault lines), the resulting damage as far as data loss could prove to be worse than that of a terrorist attack.
Data centers that find themselves in Isabel's path can call their DR provider and put them on alert to the possibility of needing floor space and equipment. However, this doesn't reserve the needed resources for recovery, and thus doesn't invoke any real action on the part of the provider. If you want to ensure that resources are available, you must actually declare a disaster before you get resources for your recovery efforts. This is also the point at which invoices should be generated as well. Therefore, because disaster declarations aren't cheap, a DR coordinator wouldn't ordinarily issue a declaration "just in case" a disaster might strike their data center.
Declaring a disaster helps to ensure that your resources are available, but doesn't actually guarantee that they are available. Declaring a disaster simply puts you in a queue behind other IT organizations that apparently had Sungard on speed dial and contacted them before your declaration. Otherwise, Sungard would need to have an inordinate amount of computing hardware and floor space at its disposal at all times when you consider the total number of its clients and the resources they would require for a regionalwide disaster that stretches as far as Hurricane Isabel.
Considering these realities--the symbolic significance of Sept. 11 and now the weaknesses of our electrical infrastructure--should give us all enough pause to wonder how we would recover our applications if the contracted recovery site was not available due to this first-come, first-serve practice. I'm not saying that you should have recovery contracts with multiple recovery vendors (Who can afford that?); but I am saying that your recovery vendor should have resources in sites across the country, and at an extreme, outside of the country to help ensure that the widest reaching disasters do not paralyze your applications indefinitely.
As for the general state of our disaster preparedness, the recovering insurance company completed restoring its applications before its contracted time expired, much to the delight of some of the executives who came to Philadelphia to witness the recovery. Hopefully, that's representative of other companies, as well. But there was room for improvement, such as how they provisioned IP addresses or prioritized application recovery. Practice may not make perfect, but it sure goes a long way toward turning vigilance into readiness.