Adaptive Capacity Labs

What makes public posts about incidents different from analysis write-ups

We have written before that documents written about an incident can take many forms and structures, depending on the author(s), purpose, and target audience. The goal of this post is to describe what makes public-facing articles that companies publish about incidents different from internal write-ups representing an effective incident analysis, and a rationale for why

Read More »

Understanding Incidents: Three Analytical Traps

Dr. Johan Bergström, who leads the MSc program in Human Factors and Systems Safety at Lund University (I am an alumnus) has a short ~7 minute video discussing three common analytical traps that incident analysts and accident investigators can get caught in. They are: 1. Counterfactual reasoning 2. Normative language 3. Mechanistic reasoning Have you seen any

Read More »

Incident Phenomena: Shorthand Names, à la Danny Ocean

These are just a few “Above-The-Line” patterns we’ve observed in cases over time, and potential nicknames for the patterns in the style of how cons are named in movies such as Ocean’s Eleven. Is this shorthand that we use in regular use in our work? Not regularly, but it’s been fun to name them once

Read More »

Jeli.io: Supporting Grounded Incident Analysis

I am an angel investor in Jeli.io, and could not be more excited for the product to come out of “stealth” mode, for many reasons! At its core, Jeli is built around the easy collection and annotation of this data, helping analysts make connections between chat transcripts, interviews, prior related incidents, and a whole host of

Read More »

Troubled Times: Episode 3

Demonstrating adaptability and the need for sustained adaptability Projections made in previous TT episode The projections made in Episode 2 are largely confirmed by experience during the summer. Private communications with individuals on these topics reveal substantial levels of stress and stress-related disturbances, notably sleep cycle disturbances, vivid dreaming, varying degrees of anhedonia, strained personal

Read More »

Presentation: “Findings From The Field”

A few weeks ago I gave a talk at the DevOps Enterprise Summit London (which was virtual). The description of the talk: In the past two years, we have had the opportunity to observe and explore the real nitty-gritty of how organizations handle, perceive, value, and treat the incidents that they experience. While the size,

Read More »

How Learning is Different Than Fixing

I was honored to present a talk at the AllTheTalks conference a few weeks back. tl;dr: slides are here, video and transcript is below The topic was incident analysis (big surprise there!) and the notion of learning and fixing, and how these activities are related but not the same. A key idea here is that rather than focusing on simply focusing on

Read More »

Troubled Times, Episode 2

Operators and Organizations Coping with Prolonged and Stressful Emergencies We’ve made some observations over the past month about how tech organizations are adjusting their practices as a result of the COVID-19 pandemic. It’s still too early to synthesize these observations into what we might call “patterns” but some of them do appear to be taking

Read More »

Troubled Times, Episode 1

The Criticality of Sustaining the Deployment Pipeline During COVID-19 Much thanks to the members of the Learning From Incidents community who helped review and contribute to drafts of this post! The world is in crisis but I feel like the approach that tech companies around the world are taking to immediately add significant bureaucracy to

Read More »

Can Resilience Engineering be sufficiently described in 5 minutes?

I had to take up the challenge of speaking about Resilience Engineering at an Ignite talk that I gave at the always excellent NYC DevOpsDays. Here’s the abstract for this 5 minute talk: Of course the answer to the question in the title is “no” because this twenty-year old multidisciplinary field is as broad and

Read More »

The Resilience of Bone and Resilience Engineering

Last year at the Re-Deploy conference, my colleague Dr. Richard Cook gave one of the most cogent and clear talks on descriptions of resilience and resilience engineering that I’ve yet to come across. I’m embedding the video below, but I’m also including an interactive transcript for those who prefer reading rather than watching and listening.

Read More »

“Tricks of the Trade” on ‘how?’ versus ‘why?’

A question today on Twitter regarding the distinction between “how” questions and “why?” questions reminded me of another reflection on this topic, in Tricks of the Trade: How to Think About Your Research While You’re Doing It.  Howard Becker (University of Chicago) in 1998 wrote on page 58: Ask “How?” Not “Why?” Everyone knows this

Read More »

The Requirements For Aftermath Projects

Given that this is the holiday season in the US and both “Black Friday” and “Cyber Monday” are coming up, we thought we’d give some more detail on what doing an Aftermath project entails. We sincerely hope not to hear from you! What is an Aftermath project? An “Aftermath” project is an immediate investigation and

Read More »

Markers of Progress in Incident Analysis

Quite often, we will talk to new clients about what “learning from incidents” means to them. The descriptions we get in response tend to be important signals for us as we work through whatever project we’re engaged with. Invariably, people will mention signals that learning is taking place in the form of what we might

Read More »

Some Observations On the Messy Realities of Incident Reviews

1) Incident reviews serve multiple purposes. Some of these purposes are overt and explicit but many are not.  Some of these purposes cut across others in unproductive ways. Competing agendas reveal the power dynamics present in the organization. A manager who complains that too few action items were produced has revealed his/her real interest: the

Read More »

The Negotiability of “Severity” Levels

What does the term severity mean, in the context of incidents involving software systems? Merriam-Webster gives us this: “the quality or state of being severe: the condition of being very bad, serious, unpleasant, or harsh.” Here are a few colloquial definitions: “Severity measures the effort and expense required by the service provider to manage and resolve an

Read More »

Hindsight and Sacrifice Decisions

A few weeks ago I tweeted this thread which references sacrifice decisions and contrasts some facets of the Knight Capital (2012) case and the NYSE trading halt (2015) case: On Aug 1, 2012, a company named Knight Capital experienced a business-destroying incident. Much has been written about it, but that’s not the topic of this thread.

Read More »

Human (Cognitive) Work Happens “Above the Line”

In 2017 I gave at talk at the DevOps Enterprise Summit and described a sort of frame that my colleagues outlined in the Stella Report. I walked through it bit by bit, and I’ve captured just that 6-minute excerpt in this video here. However – not everyone likes to watch videos of conference talks, so below

Read More »

Chapter in “Seeking SRE”: SRE Cognitive Work

Dr. Richard Cook and I were honored to contribute to a chapter in the new book Seeking SRE edited by the lovely David Blank-Edelman. Consider it a signpost along the way as we learn more about how engineers actually do their work in real-world scenarios (rather than abstract or generalized descriptions of how we like to

Read More »

REdeploy Conference: Finding Sources of Resilience

In August I was honored to speak at the inaugural REdeploy conference centered on the topic of resilience. Here is the abstract for the talk: Abstract Sustaining the potential to adapt to unforeseen situations (resilience) is a necessary element in complex systems. One could say that all successful endeavors require this. But resilience is (in many ways)

Read More »

Incidents As We Imagine Them Versus How They Actually Happen

In September I gave a keynote talk at the PagerDuty Summit in San Francisco, and I echoed some of what I covered in my talk at the ReDeploy conference. For those who like oversimplifications, here is a bulleted list of “takeaways”: Real incidents (and the response to them) are messy, and do not take the nice-and-neat form

Read More »

The Multiple Audiences and Purposes of Post-Incident Reviews

The conventional rationale for undertaking some form of post-incident review (regardless of what you call this process) is to “learn from failure.” Given without much more specifics and context, this is, for the most part, a banal platitude aimed at providing at least a bit of comfort that someone is doing something in the wake of these surprising

Read More »

Moving Past Shallow Incident Data

Here is some data: the last incident your company experienced lasted 54 minutes. What insight does this data reveal besides a) an incident happened, and b) it lasted 54 minutes (at least according to someone interpreting an event as an incident)? What else could we glean from this data? Hrm. What if we found that someone marked down

Read More »

Taking Human Performance Seriously

In November last year, I gave a talk at the DevOps Enterprise Summit in San Francisco. A core point of my talk was that it’s time we start taking human performance seriously in the field of software engineering and operations: “The increasing significance of our systems, the increasing potential for economic, political, and human damage

Read More »