Last year at the Re-Deploy conference, my colleague Dr. Richard Cook gave one of the most cogent and clear talks on descriptions of resilience and resilience engineering that I’ve yet to come across. I’m embedding the video below, but I’m also including an interactive transcript for those who prefer reading rather than watching and listening.
A question today on Twitter regarding the distinction between “how” questions and “why?” questions reminded me of another reflection on this topic, in Tricks of the Trade: How to Think About Your Research While You’re Doing It. Howard Becker (University of Chicago) in 1998 wrote on page 58: Ask “How?” Not “Why?” Everyone knows this
Given that this is the holiday season in the US and both “Black Friday” and “Cyber Monday” are coming up, we thought we’d give some more detail on what doing an Aftermath project entails. We sincerely hope not to hear from you! What is an Aftermath project? An “Aftermath” project is an immediate investigation and
Quite often, we will talk to new clients about what “learning from incidents” means to them. The descriptions we get in response tend to be important signals for us as we work through whatever project we’re engaged with. Invariably, people will mention signals that learning is taking place in the form of what we might
1) Incident reviews serve multiple purposes. Some of these purposes are overt and explicit but many are not. Some of these purposes cut across others in unproductive ways. Competing agendas reveal the power dynamics present in the organization. A manager who complains that too few action items were produced has revealed his/her real interest: the
What does the term severity mean, in the context of incidents involving software systems? Merriam-Webster gives us this: “the quality or state of being severe: the condition of being very bad, serious, unpleasant, or harsh.” Here are a few colloquial definitions: “Severity measures the effort and expense required by the service provider to manage and resolve an
A few weeks ago I tweeted this thread which references sacrifice decisions and contrasts some facets of the Knight Capital (2012) case and the NYSE trading halt (2015) case: On Aug 1, 2012, a company named Knight Capital experienced a business-destroying incident. Much has been written about it, but that’s not the topic of this thread.
In 2017 I gave at talk at the DevOps Enterprise Summit and described a sort of frame that my colleagues outlined in the Stella Report. I walked through it bit by bit, and I’ve captured just that 6-minute excerpt in this video here. However – not everyone likes to watch videos of conference talks, so below
Dr. Richard Cook and I were honored to contribute to a chapter in the new book Seeking SRE edited by the lovely David Blank-Edelman. Consider it a signpost along the way as we learn more about how engineers actually do their work in real-world scenarios (rather than abstract or generalized descriptions of how we like to
In August I was honored to speak at the inaugural REdeploy conference centered on the topic of resilience. Here is the abstract for the talk: Abstract Sustaining the potential to adapt to unforeseen situations (resilience) is a necessary element in complex systems. One could say that all successful endeavors require this. But resilience is (in many ways)
In September I gave a keynote talk at the PagerDuty Summit in San Francisco, and I echoed some of what I covered in my talk at the ReDeploy conference. For those who like oversimplifications, here is a bulleted list of “takeaways”: Real incidents (and the response to them) are messy, and do not take the nice-and-neat form
The conventional rationale for undertaking some form of post-incident review (regardless of what you call this process) is to “learn from failure.” Given without much more specifics and context, this is, for the most part, a banal platitude aimed at providing at least a bit of comfort that someone is doing something in the wake of these surprising
Here is some data: the last incident your company experienced lasted 54 minutes. What insight does this data reveal besides a) an incident happened, and b) it lasted 54 minutes (at least according to someone interpreting an event as an incident)? What else could we glean from this data? Hrm. What if we found that someone marked down
In November last year, I gave a talk at the DevOps Enterprise Summit in San Francisco. A core point of my talk was that it’s time we start taking human performance seriously in the field of software engineering and operations: “The increasing significance of our systems, the increasing potential for economic, political, and human damage