Adaptive Capacity Labs

Some Observations On the Messy Realities of Incident Reviews

1) Incident reviews serve multiple purposes. Some of these purposes are overt and explicit but many are not.  Some of these purposes cut across others in unproductive ways. Competing agendas reveal the power dynamics present in the organization. A manager who complains that too few action items were produced has revealed his/her real interest: the

Read More »

The Negotiability of “Severity” Levels

What does the term severity mean, in the context of incidents involving software systems? Merriam-Webster gives us this: “the quality or state of being severe: the condition of being very bad, serious, unpleasant, or harsh.” Here are a few colloquial definitions: “Severity measures the effort and expense required by the service provider to manage and resolve an

Read More »

Hindsight and Sacrifice Decisions

A few weeks ago I tweeted this thread which references sacrifice decisions and contrasts some facets of the Knight Capital (2012) case and the NYSE trading halt (2015) case: On Aug 1, 2012, a company named Knight Capital experienced a business-destroying incident. Much has been written about it, but that’s not the topic of this thread.

Read More »

Human (Cognitive) Work Happens “Above the Line”

In 2017 I gave at talk at the DevOps Enterprise Summit and described a sort of frame that my colleagues outlined in the Stella Report. I walked through it bit by bit, and I’ve captured just that 6-minute excerpt in this video here. However – not everyone likes to watch videos of conference talks, so below

Read More »

Chapter in “Seeking SRE”: SRE Cognitive Work

Dr. Richard Cook and I were honored to contribute to a chapter in the new book Seeking SRE edited by the lovely David Blank-Edelman. Consider it a signpost along the way as we learn more about how engineers actually do their work in real-world scenarios (rather than abstract or generalized descriptions of how we like to

Read More »

REdeploy Conference: Finding Sources of Resilience

In August I was honored to speak at the inaugural REdeploy conference centered on the topic of resilience. Here is the abstract for the talk: Abstract Sustaining the potential to adapt to unforeseen situations (resilience) is a necessary element in complex systems. One could say that all successful endeavors require this. But resilience is (in many ways)

Read More »

Incidents As We Imagine Them Versus How They Actually Happen

In September I gave a keynote talk at the PagerDuty Summit in San Francisco, and I echoed some of what I covered in my talk at the ReDeploy conference. For those who like oversimplifications, here is a bulleted list of “takeaways”: Real incidents (and the response to them) are messy, and do not take the nice-and-neat form

Read More »

The Multiple Audiences and Purposes of Post-Incident Reviews

The conventional rationale for undertaking some form of post-incident review (regardless of what you call this process) is to “learn from failure.” Given without much more specifics and context, this is, for the most part, a banal platitude aimed at providing at least a bit of comfort that someone is doing something in the wake of these surprising

Read More »

Moving Past Shallow Incident Data

Here is some data: the last incident your company experienced lasted 54 minutes. What insight does this data reveal besides a) an incident happened, and b) it lasted 54 minutes (at least according to someone interpreting an event as an incident)? What else could we glean from this data? Hrm. What if we found that someone marked down

Read More »

Taking Human Performance Seriously

In November last year, I gave a talk at the DevOps Enterprise Summit in San Francisco. A core point of my talk was that it’s time we start taking human performance seriously in the field of software engineering and operations: “The increasing significance of our systems, the increasing potential for economic, political, and human damage

Read More »