In September I gave a keynote talk at the PagerDuty Summit in San Francisco, and I echoed some of what I covered in my talk at the ReDeploy conference.
For those who like oversimplifications, here is a bulleted list of “takeaways”:
- Real incidents (and the response to them) are messy, and do not take the nice-and-neat form of: detect⟶diagnose⟶repair (or derivative versions of it)
- The experience that engineers have during incidents are palpably familiar, critical to the evolution of the incident, and also poorly captured in typical post-incident documents.
- I believe post-incident review (or “post-mortem”) templates are at best inadequate (and capture mostly shallow data) and at worst crutches for an organization that provides only an illusion of valuable analysis and learning.
The central premise of the talk is that there exists a gap between how we tend to think incidents happen and how they actually happen in reality.
I began the talk by asking the audience a couple of questions. I asked people to raise their hand if, in the course of responding to an incident:
“…they have ever experienced such a profound sense of confusion about what signals and symptoms they were seeing — that they were at a loss to come up with any plausible explanation?“
Many hands went up, along with some nervous chuckles and nodding heads.
“…when they were about to take some action — intended to either “fix” the issue or preventing it from getting worse — had experienced a sort or “fight-or-flight” feeling of uncertainty about whether this action would indeed improve the situation or possibly make things worse?”
Many hands were lifted again, along with more laughs and nodding heads.
I said that how we imagine incidents happen can have a significant influence on what and how we learn from them. How we imagine they generically happen influences:
- what we think is important about the incident…and therefore what we dismiss
- what we discuss about the incident…and therefore what we don’t
- what we write about the incident…and therefore what we leave out
In other words: what is deemed important gets discussed, and only what gets discussed has a chance of getting written down. If it’s not captured in a narrative form for readers in ways that can evoke new and better questions about the event (because there can’t be a truly comprehensive account of the incident) – then it will be lost.
I also reminded the audience that post-incident reviews can serve multiple purposes and have multiple audiences.
Here is the video for the talk: