How resilient are your operations?

A Bloomberg article about the impact of system failures at Delta Airlines earlier this week said “Delta System Failure Marks Wake-Up Call for Airline Industry“.

In fact, I suspect Boards of Directors in every industry, not just the airline industry, are asking their CEOs for an assessment of the risk of their systems having a similar system-wide failure.

The Wall Street Journal is reporting that Delta’s CEO, Ed Bastien, is taking full responsibility for the outage:

Over the past three years, the nation’s No. 2 airline by traffic has spent “hundreds of millions” in upgrades and systems, including $150 million this year alone. Delta earlier this year named a new chief information officer and has brought in new leaders for its information technology and infrastructure team.

“It’s not clear the priorities in our investment have been in the right place,” Mr. Bastian said. “It has caused us to ask a lot of questions which candidly we don’t have a lot of answers for.”

Years ago, I spent a fair bit of time with an external auditor who wanted to understand more about how our network was configured. We talked about risks and ways to mitigate those risks. When negotiating fibre swaps, we looked at detailed maps to ensure that we were really getting improved physical diversity, not sharing the same railroad tracks, bridges, etc. As a result, physical failures from fibre cuts or power outages often have backups to restore service or, at worst, will generally result in a limited, somewhat localized outage.

Software changes often present the most substantial risk, with updates being rolled out system wide over a short period of time. How often have we seen failures arise from software failures that weren’t detected in the labs and did not materialize until subjected to peak traffic loads?

It isn’t enough to spend money on system resilience. Delta shows that money needs to be spent in the right places.

As more devices get connected in the Internet of Things, and with autonomous drones and cars, companies need to take a fresh look at system resilience, understanding the risk of failures and the costs that can arise.

Not every system has to be up all the time. But does your Board understand the cost of failure?

Scroll to Top