ACL Blog

The Career, Accomplishments, and Impact of Richard I. Cook: A Life in Many Acts

Multiple professional and research communities feel a profound loss at the death of Richard I. Cook. Richard died peacefully at home on August 31, 2022 in the loving care of his wife Karen and his family. Dr. Richard Cook was a polymath who excelled in multiple careers, usually simultaneously. A physician and anesthesiologist, he was […]

The Career, Accomplishments, and Impact of Richard I. Cook: A Life in Many Acts Read More »

What makes public posts about incidents different from analysis write-ups

By John Allspaw

We have written before that documents written about an incident can take many forms and structures, depending on the author(s), purpose, and target audience. The goal of this post is to describe what makes public-facing articles that companies publish about incidents different from internal write-ups representing an effective incident analysis, and a rationale for why

What makes public posts about incidents different from analysis write-ups Read More »

Understanding Incidents: Three Analytical Traps

By John Allspaw

Dr. Johan Bergström, who leads the MSc program in Human Factors and Systems Safety at Lund University (I am an alumnus) has a short ~7 minute video discussing three common analytical traps that incident analysts and accident investigators can get caught in. They are: 1. Counterfactual reasoning 2. Normative language 3. Mechanistic reasoning Have you seen any

Understanding Incidents: Three Analytical Traps Read More »

Incident Phenomena: Shorthand Names, à la Danny Ocean

By John Allspaw

These are just a few “Above-The-Line” patterns we’ve observed in cases over time, and potential nicknames for the patterns in the style of how cons are named in movies such as Ocean’s Eleven. Is this shorthand that we use in regular use in our work? Not regularly, but it’s been fun to name them once

Incident Phenomena: Shorthand Names, à la Danny Ocean Read More »

Jeli.io: Supporting Grounded Incident Analysis

By John Allspaw

I am an angel investor in Jeli.io, and could not be more excited for the product to come out of “stealth” mode, for many reasons! At its core, Jeli is built around the easy collection and annotation of this data, helping analysts make connections between chat transcripts, interviews, prior related incidents, and a whole host of

Jeli.io: Supporting Grounded Incident Analysis Read More »

Troubled Times: Episode 3

By Dr. Richard I. Cook

Demonstrating adaptability and the need for sustained adaptability Projections made in previous TT episode The projections made in Episode 2 are largely confirmed by experience during the summer. Private communications with individuals on these topics reveal substantial levels of stress and stress-related disturbances, notably sleep cycle disturbances, vivid dreaming, varying degrees of anhedonia, strained personal

Troubled Times: Episode 3 Read More »

Presentation: “Findings From The Field”

By John Allspaw

A few weeks ago I gave a talk at the DevOps Enterprise Summit London (which was virtual). The description of the talk: In the past two years, we have had the opportunity to observe and explore the real nitty-gritty of how organizations handle, perceive, value, and treat the incidents that they experience. While the size,

Presentation: “Findings From The Field” Read More »

How Learning is Different Than Fixing

By John Allspaw

I was honored to present a talk at the AllTheTalks conference a few weeks back. tl;dr: slides are here, video is below The topic was incident analysis (big surprise there!) and the notion of learning and fixing, and how these activities are related but not the same. A key idea here is that rather than focusing on simply focusing on identifying fixes

How Learning is Different Than Fixing Read More »

Troubled Times, Episode 2

By Dr. Richard I. Cook

Operators and Organizations Coping with Prolonged and Stressful Emergencies We’ve made some observations over the past month about how tech organizations are adjusting their practices as a result of the COVID-19 pandemic. It’s still too early to synthesize these observations into what we might call “patterns” but some of them do appear to be taking

Troubled Times, Episode 2 Read More »

Incidents: What Is Often Missed & What Can Be Done About That

By John Allspaw

I was invited to give a talk at the Spotify office in New York last month on the topic of learning from incidents. Here is the first slide, which I hope to be the BLUF (“bottom line, up front”) of the talk… Here is the video recording of the talk…

Incidents: What Is Often Missed & What Can Be Done About That Read More »

Troubled Times, Episode 1

By Dr. Richard I. Cook

The Criticality of Sustaining the Deployment Pipeline During COVID-19 Much thanks to the members of the Learning From Incidents community who helped review and contribute to drafts of this post! The world is in crisis but I feel like the approach that tech companies around the world are taking to immediately add significant bureaucracy to

Troubled Times, Episode 1 Read More »

Can Resilience Engineering be sufficiently described in 5 minutes?

By John Allspaw

I had to take up the challenge of speaking about Resilience Engineering at an Ignite talk that I gave at the always excellent NYC DevOpsDays. Here’s the abstract for this 5 minute talk: Of course the answer to the question in the title is “no” because this twenty-year old multidisciplinary field is as broad and

Can Resilience Engineering be sufficiently described in 5 minutes? Read More »

A Resilience Engineer’s Diary on COVID-19

By John Allspaw

During the early weeks of the COVID-19 pandemic, Dr. David Woods wrote diary entries about his observations as the pandemic unfolded.

A Resilience Engineer’s Diary on COVID-19 Read More »

The Resilience of Bone and Resilience Engineering

By John Allspaw

Last year at the Re-Deploy conference, my colleague Dr. Richard Cook gave one of the most cogent and clear talks on descriptions of resilience and resilience engineering that I’ve yet to come across. I’m embedding the video below, but I’m also including an interactive transcript for those who prefer reading rather than watching and listening.

The Resilience of Bone and Resilience Engineering Read More »

“Tricks of the Trade” on ‘how?’ versus ‘why?’

By John Allspaw

A question today on Twitter regarding the distinction between “how” questions and “why?” questions reminded me of another reflection on this topic, in Tricks of the Trade: How to Think About Your Research While You’re Doing It. Howard Becker (University of Chicago) in 1998 wrote on page 58: Ask “How?” Not “Why?” Everyone knows this

“Tricks of the Trade” on ‘how?’ versus ‘why?’ Read More »

The Requirements For Aftermath Projects

By John Allspaw

Given that this is the holiday season in the US and both “Black Friday” and “Cyber Monday” are coming up, we thought we’d give some more detail on what doing an Aftermath project entails. We sincerely hope not to hear from you! What is an Aftermath project? An “Aftermath” project is an immediate investigation and

The Requirements For Aftermath Projects Read More »

Markers of Progress in Incident Analysis

By John Allspaw

Quite often, we will talk to new clients about what “learning from incidents” means to them. The descriptions we get in response tend to be important signals for us as we work through whatever project we’re engaged with. Invariably, people will mention signals that learning is taking place in the form of what we might

Markers of Progress in Incident Analysis Read More »

Some Observations On the Messy Realities of Incident Reviews

By Dr. Richard I. Cook

1) Incident reviews serve multiple purposes. Some of these purposes are overt and explicit but many are not. Some of these purposes cut across others in unproductive ways. Competing agendas reveal the power dynamics present in the organization. A manager who complains that too few action items were produced has revealed his/her real interest: the

Some Observations On the Messy Realities of Incident Reviews Read More »

The Negotiability of “Severity” Levels

By John Allspaw

What does the term severity mean, in the context of incidents involving software systems? Merriam-Webster gives us this: “the quality or state of being severe: the condition of being very bad, serious, unpleasant, or harsh.” Here are a few colloquial definitions: “Severity measures the effort and expense required by the service provider to manage and resolve an

The Negotiability of “Severity” Levels Read More »

Hindsight and Sacrifice Decisions

By John Allspaw

A few weeks ago I tweeted this thread which references sacrifice decisions and contrasts some facets of the Knight Capital (2012) case and the NYSE trading halt (2015) case: On Aug 1, 2012, a company named Knight Capital experienced a business-destroying incident. Much has been written about it, but that’s not the topic of this thread.

Hindsight and Sacrifice Decisions Read More »

Human (Cognitive) Work Happens “Above the Line”

By John Allspaw

In 2017 I gave at talk at the DevOps Enterprise Summit and described a sort of frame that my colleagues outlined in the Stella Report. I walked through it bit by bit, and I’ve captured just that 6-minute excerpt in this video here. However – not everyone likes to watch videos of conference talks, so below

Human (Cognitive) Work Happens “Above the Line” Read More »

Chapter in “Seeking SRE”: SRE Cognitive Work

By John Allspaw

Dr. Richard Cook and I were honored to contribute to a chapter in the new book Seeking SRE edited by the lovely David Blank-Edelman. Consider it a signpost along the way as we learn more about how engineers actually do their work in real-world scenarios (rather than abstract or generalized descriptions of how we like to

Chapter in “Seeking SRE”: SRE Cognitive Work Read More »

REdeploy Conference: Finding Sources of Resilience

By John Allspaw

In August I was honored to speak at the inaugural REdeploy conference centered on the topic of resilience. Here is the abstract for the talk: Abstract Sustaining the potential to adapt to unforeseen situations (resilience) is a necessary element in complex systems. One could say that all successful endeavors require this. But resilience is (in many ways)

REdeploy Conference: Finding Sources of Resilience Read More »

Incidents As We Imagine Them Versus How They Actually Happen

By John Allspaw

In September I gave a keynote talk at the PagerDuty Summit in San Francisco, and I echoed some of what I covered in my talk at the ReDeploy conference. For those who like oversimplifications, here is a bulleted list of “takeaways”: The central premise of the talk is that there exists a gap between how we tend

Incidents As We Imagine Them Versus How They Actually Happen Read More »

The Multiple Audiences and Purposes of Post-Incident Reviews

By John Allspaw

The conventional rationale for undertaking some form of post-incident review (regardless of what you call this process) is to “learn from failure.” Given without much more specifics and context, this is, for the most part, a banal platitude aimed at providing at least a bit of comfort that someone is doing something in the wake of these surprising

The Multiple Audiences and Purposes of Post-Incident Reviews Read More »

Moving Past Shallow Incident Data

By John Allspaw

Here is some data: the last incident your company experienced lasted 54 minutes. What insight does this data reveal besides a) an incident happened, and b) it lasted 54 minutes (at least according to someone interpreting an event as an incident)? What else could we glean from this data? Hrm. What if we found that someone marked down

Moving Past Shallow Incident Data Read More »

Taking Human Performance Seriously

By John Allspaw

In November last year, I gave a talk at the DevOps Enterprise Summit in San Francisco. A core point of my talk was that it’s time we start taking human performance seriously in the field of software engineering and operations: “The increasing significance of our systems, the increasing potential for economic, political, and human damage

Taking Human Performance Seriously Read More »