The Negotiability of “Severity” Levels

What does the term severity mean, in the context of incidents involving software systems? Merriam-Webster gives us this: “the quality or state of being severe: the condition of being very bad, serious, unpleasant, or harsh.” Here are a few colloquial definitions: “Severity measures the effort and expense required by the service provider to manage and resolve an […]

Hindsight and Sacrifice Decisions

A few weeks ago I tweeted this thread which references sacrifice decisions and contrasts some facets of the Knight Capital (2012) case and the NYSE trading halt (2015) case: On Aug 1, 2012, a company named Knight Capital experienced a business-destroying incident. Much has been written about it, but that’s not the topic of this thread. […]

Chapter in “Seeking SRE”: SRE Cognitive Work

Dr. Richard Cook and I were honored to contribute to a chapter in the new book Seeking SRE edited by the lovely David Blank-Edelman. Consider it a signpost along the way as we learn more about how engineers actually do their work in real-world scenarios (rather than abstract or generalized descriptions of how we like to […]

Moving Past Shallow Incident Data

Here is some data: the last incident your company experienced lasted 54 minutes. What insight does this data reveal besides a) an incident happened, and b) it lasted 54 minutes (at least according to someone interpreting an event as an incident)? What else could we glean from this data? Hrm. What if we found that someone marked down […]

Taking Human Performance Seriously

In November last year, I gave a talk at the DevOps Enterprise Summit in San Francisco. A core point of my talk was that it’s time we start taking human performance seriously in the field of software engineering and operations: “The increasing significance of our systems, the increasing potential for economic, political, and human damage […]