Q&A: What went wrong (and right) during the online outage

Six minutes after midnight on Jan. 10, in the Administrative Services Building (ASB) second floor data center, a major storage device for Iowa State's servers failed. On its way down, the device sent an electronic SOS to its creator, EMC Corp. About the same time EMC was hearing from the device, Iowa State techs began noticing problems on university sites. It was the start of a 20-hour outage that would take down many sites and services and make for a busy Thursday as information technology services staff and others scrambled to get things back online.

Michael Lohrbach, senior systems analyst in ITS, explains what went wrong (and right) during the outage.

What happened to the storage device?

The storage device has built-in redundancy to prevent failures like this. Unfortunately, both the primary and backup components failed at the same time. That's very rare. Our vendor (EMC) believes there was a bug in the device's software that caused the failure, and a software upgrade is on the way.

When the storage device went down, it knocked out about half of Iowa State's VMware environment. VMware is virtualization technology that allows dozens of servers to be loaded onto a single server for the most cost-effective, efficient handling. In addition to knocking out many servers, the failing device unfortunately caused otherwise functioning machines to lock. The lockups prevented ITS staff from immediately restoring downed servers with backup data.

What sites and services went down in the outage?

Unavailable sites included the ISU homepage, many college and departmental sites, AccessPlus and Blackboard Learn. Microsoft Exchange was running, but only on half the servers it normally uses, causing some delays in message delivery and intermittent connectivity issues.

Were backups available to restore sites?

We have storage and backup devices in both ASB and the Durham Center, and much of the data is replicated between the centers. However, many systems have dependencies, and the level at which they work varies by the method of clustering or failover setup for the service. As we proceeded, decisions needed to be made on restoring the service from backup or working to bring it back online. We ended up using both methods to bring systems online in an effort to restore service as quickly as possible.

Who was working on the outage?

We had about a dozen people working on the issue in the Durham data center. Beyond that room, information technology services staff and college and departmental IT staff elsewhere on campus were working on related tasks. We had a lot of good communication and support from IT units in the colleges and departments, whose staff were very understanding of how these issues happen and willing to help out in any way they could. Four to five EMC engineers were on the case, including Des Moines-area representatives as well as specialists across the world, who were on the phone with us from about 3 a.m. to 7 p.m. Thursday. It was all hands on deck.

With their services and sites out of commission, some departments turned to Facebook, Twitter and low-tech web pages to get information out. How did that work?

We heard comments from people who said they were watching these sites for updates. That was very useful, as was the feedback, relayed through social media sites. We in ITS were focusing on getting things back online, rather than testing systems. It was good to have administrators of other sites and general users keeping us informed about what was and wasn't working.

When was the outage over?

Most sites and services were back online by 6:30 p.m. Thursday. After that, we spent several more hours cleaning up. When something like this happens, systems don't have a chance to shut down gracefully and that can cause issues. So we rebooted systems and took other actions to get them back into a healthy state. By 10 p.m., almost everything was back to normal.

What did the outage cost Iowa State?

Staff working on the outage are salaried P&S staff, and we have full maintenance and support contracts with our vendors. So staffing and repair costs on the failed storage device are covered. It may be hard to tally, but the biggest cost likely incurred in departments that lost functionality for a day. Thankfully, email and file services were functional.  Despite the hardships throughout campus, we had a lot of support from the Iowa State community during the outage. We are very appreciative and lucky to have such a good relationship with the IT units and departments across campus and with the people using our services.

What's next?

We're turning our attention to how we can prevent this from happening again and how we can recover more quickly when something goes wrong.  We in ITS take our service -- our reputation, security and data protection -- very seriously. We'll continue to work to make that service as good and reliable as it can be.