This recent outage really highlights the importance of testing. How often should power systems be tested? Who should be involved in those tests?
Testing should be done at least monthly, and after any significant electrical event. Be sure to add surge arrestors to copper network and telco cabling as well as wall power. These can be conduits for an electrical surge. Were you involved in any "black-out" situations during hurricanes? If so, what were the first steps you took to restore your systems? What types of precautionary steps did you have in place?
Power outages always accompany weather events. We have gotten used to them here in Florida. Fact is though that a hurricane is a phased disaster scenario. That is to say, there is a lot of advance warning that the storm may come your way. So, you can take some proactive measures and power everything down in plenty of time to comply with an evacuation order.
A few years ago, we did have problems during an abnormally cold winter. Power was traded from our generating facilities to northern grids that needed it for heating. The result was rolling outages during Christmas Day. But, companies had gone to some effort as part of hurricane planning to work with their utility service providers and ensure that redundant switching equipment was used to supply them. When one set of gear was taken out of service as part of the rolling outage, they were still up and running on a different set of gear. It just takes preplanning. After events like the rolling blackouts in California, do you think most of the companies in the affected regions were prepared for this?
Nope. Like anything else, people tend to put the past behind them. Protection costs money and CFOs these days often have a hard time justifying expenditures for capabilities that, in the best of circumstances, would never need to be used. Moreover, distributed computing has increased the number of vulnerable assets to power problems: Things aren't collected inside the glass house of the data center anymore. So, most companies are a target-rich environment for power-related events. Protecting a lot of distributed things is more expensive than protecting a lot of centralized things. Money is in short supply in most companies today. What should admins do as a post-mortem to this event?
Look hard at this event from the standpoint of distances between your primary and backup facility or hot site. The Fed wimped out on specifying a minimum acceptable distance between facilities after 9/11, but that doesn't mean that companies can't take the initiative themselves. Your backup site should not be served by the same power supplier or delivery infrastructure as your primary. It's as simple as that.
Secondly, look seriously at your internal power generation and power protection capabilities. Spending a few bucks now can keep equipment in service longer and possibly cut down on transient-related software and hardware errors.