Hindsight and Sacrifice Decisions

A few weeks ago I tweeted this thread which references sacrifice decisions and contrasts some facets of the Knight Capital (2012) case and the NYSE trading halt (2015) case:

On Aug 1, 2012, a company named Knight Capital experienced a business-destroying incident. Much has been written about it, but that’s not the topic of this thread.

In the aftermath, the amount of hindsight-bias-fueled-armchair-quarterbacking on this event knew no bounds. 1/n

— John Allspaw (@allspaw) February 15, 2019

From HN to blogs to Twitter, the finger pointing on what they “should have” done was rampant. This fervor culminated and was validated by the SEC’s official report on the event, which was effectively a case study on what a post-incident review should *not* look like. 2/n

— John Allspaw (@allspaw) February 15, 2019

(https://t.co/fmVSDCzPXl)

Among the finger-wagging (in the report and greater public reflection) was that they should have halted trade execution as soon as they knew something was amiss. 3/n

— John Allspaw (@allspaw) February 15, 2019

“reckless” “negligent” “naive” – these were the tomatoes lobbed at a group whose business went bankrupt in less than an hour.

Now, fast forward 3 years to July 9, 2015, at the New York Stock Exchange. 4/n

— John Allspaw (@allspaw) February 15, 2019

Once they discovered a potentially significant bug in their systems, they made the decision that the peanut-gallery cried Knight Capital should have made: they halted their systems in abundance of caution.

The clone army of Captain Hindsights suited up, ready to go. 5/n

— John Allspaw (@allspaw) February 15, 2019

My colleagues and others penned a submission to the @nytimes Op-ed section, reflecting our view that the NYSE’s initiative to make a “sacrifice decision” like that was significant and something to be celebrated. 6/n

— John Allspaw (@allspaw) February 15, 2019

They chose not to run our piece and instead ran “The Bumbling and Irrelevant New York Stock Exchange” (https://t.co/EJEZJprfPz) which said (effectively) that the halting was unnecessary for a “plain-vanilla” glitch. 7/n

— John Allspaw (@allspaw) February 15, 2019

The point of this thread is to bring attention to the notion that our *reactions* to surprising events are the fuel that effectively dictates what we learn from them.

“You’re moving too slow!” “You’re moving too fast!” 8/n

— John Allspaw (@allspaw) February 15, 2019

These admonishments come easily when you know the outcome.

When you don’t, you do *what you think is necessary to balance multiple conflicting demands such as time pressure and being thorough and efficient at the same time. 9/n

— John Allspaw (@allspaw) February 15, 2019

How you characterize an event (or the absence of an event!) in hindsight and from the luxury of your distant view may help you feel comfortable with knowing the outcome, but it does nothing to understand or explain the world of those who experienced it at the time. 10/10

— John Allspaw (@allspaw) February 15, 2019

Because some have asked, here was our submission to the NY Times Op-Ed section on the topic, which was not accepted:

To: opinion@nytimes.com

Date: July 9, 2015

The NYSE’s halt to trading for 3 hours on Tuesday, 7 July, is a success story that should be celebrated by everyone.

What? How can a halt to trading of one of the most important exchanges in the world be a success? Their system crashed! What were those guys in the ops room thinking? This is surely a disaster!

Nope. It’s a great success.

Technical systems fail. All technical systems fail. Airline computer systems, newspaper websites, and, yes, financial trading systems. They all fail. We work hard to build defenses into important technical systems. Most of the cost of building and running these technical systems goes into figuring out how things can fail, building in defenses, failovers and redundancies. But no one should imagine that this makes them failure proof. And failures are costly.

Knowing this, good organizations invest in failure. A lot of this investment is mundane. A lot of it is in binders and wikis filled with procedures and phone lists and backup planning. Some of it is in practice, simulation, training. And some of the investment is in guts and fortitude and culture. The goal of this investment is not to prevent failure. The goal is to be ready to stop the damage and recover from failure. And failures are costly.

The 7 July halt is a success.

The NYSE organization did all the right things when the technical system failed:

1. The system failed, it did not crash. A specific part of it stopped working.
2. The management and technical staff recognized the system wasn’t working.
3. They engaged everyone in the organization. Everyone paid attention to what was going on.
4. They deliberately took the system down. They accepted a loss (of profit, face, reputation) because it was most important to them to “get it right” and “do no harm” (Mr. Farley’s, the president of the company’s own words).
5. They announced that they were taking it down. Mr. Farley went on live television and told people that they had a technical problem.
6. The president of the company took responsibility for the decision to take the system down.
7. The president praised the company technical staff. He did not blame anyone for the event.
8. The ops people diagnosed and fixed the problem in such a way that they could bring the system back up.
9. They brought the system up and closed the day.

We study reactions to failure. The NYSE halt is a textbook case of how to do it. Was there a lot of behind the scenes cursing and breaking of china. Sure. So what? The system worked. The people who fixed the problem were the same ones who would normally get blamed for allowing it into the system in the first place. Who better knows what is broken and how to fix it? No one. For 4 hours, those people were the single most valuable resource of the entire company.

This is how you succeed.

Our systems are going to fail. How and why they fail is one problem.

How we respond is another. NYSE should be applauded. They deserve it. We have no doubt that the operations teams were always working to “get it right” during the entire event.

(signed)

John Allspaw
Zoran Perkov
David Woods
Richard Cook

================================================
Affiliations:

John Allspaw is the Senior Vice President for of Infrastructure and Operations at Etsy.
Zoran Perkov is the Head of Technology Operations at IEX
David Woods is Professor of Integrated Systems Engineering at The Ohio State University
Richard Cook is Professor Emeritus of Healthcare Systems Safety at the Royal Institute of Technology, Stockholm

The views expressed are those of the individuals and not their employers.

About The Author

John Allspaw