Adaptive Capacity Labs
rear-view-932085_1920

Hindsight and Sacrifice Decisions

A few weeks ago I tweeted this thread which references sacrifice decisions and contrasts some facets of the Knight Capital (2012) case and the NYSE trading halt (2015) case:

Because some have asked, here was our submission to the NY Times Op-Ed section on the topic, which was not accepted:

To: opinion@nytimes.com

Date: July 9, 2015

The NYSE’s halt to trading for 3 hours on Tuesday, 7 July, is a success story that should be celebrated by everyone.

What? How can a halt to trading of one of the most important exchanges in the world be a success? Their system crashed! What were those guys in the ops room thinking? This is surely a disaster!

Nope. It’s a great success.

Technical systems fail. All technical systems fail. Airline computer systems, newspaper websites, and, yes, financial trading systems. They all fail. We work hard to build defenses into important technical systems. Most of the cost of building and running these technical systems goes into figuring out how things can fail, building in defenses, failovers and redundancies. But no one should imagine that this makes them failure proof. And failures are costly.

Knowing this, good organizations invest in failure. A lot of this investment is mundane. A lot of it is in binders and wikis filled with procedures and phone lists and backup planning. Some of it is in practice, simulation, training. And some of the investment is in guts and fortitude and culture. The goal of this investment is not to prevent failure. The goal is to be ready to stop the damage and recover from failure. And failures are costly.

The 7 July halt is a success.

The NYSE organization did all the right things when the technical system failed:

1. The system failed, it did not crash. A specific part of it stopped working.
2. The management and technical staff recognized the system wasn’t working.
3. They engaged everyone in the organization. Everyone paid attention to what was going on.
4. They deliberately took the system down. They accepted a loss (of profit, face, reputation) because it was most important to them to “get it right” and “do no harm” (Mr. Farley’s, the president of the company’s own words).
5. They announced that they were taking it down. Mr. Farley went on live television and told people that they had a technical problem.
6. The president of the company took responsibility for the decision to take the system down.
7. The president praised the company technical staff. He did not blame anyone for the event.
8. The ops people diagnosed and fixed the problem in such a way that they could bring the system back up.
9. They brought the system up and closed the day.

We study reactions to failure. The NYSE halt is a textbook case of how to do it. Was there a lot of behind the scenes cursing and breaking of china. Sure. So what? The system worked. The people who fixed the problem were the same ones who would normally get blamed for allowing it into the system in the first place. Who better knows what is broken and how to fix it? No one. For 4 hours, those people were the single most valuable resource of the entire company.

This is how you succeed.

Our systems are going to fail. How and why they fail is one problem.

How we respond is another. NYSE should be applauded. They deserve it. We have no doubt that the operations teams were always working to “get it right” during the entire event.

(signed)

John Allspaw
Zoran Perkov
David Woods
Richard Cook

================================================
Affiliations:

John Allspaw is the Senior Vice President for of Infrastructure and Operations at Etsy.
Zoran Perkov is the Head of Technology Operations at IEX
David Woods is Professor of Integrated Systems Engineering at The Ohio State University
Richard Cook is Professor Emeritus of Healthcare Systems Safety at the Royal Institute of Technology, Stockholm

The views expressed are those of the individuals and not their employers.