The conventional rationale for undertaking some form of post-incident review (regardless of what you call this process) is to “learn from failure.”
Given without much more specifics and context, this is, for the most part, a banal platitude aimed at providing at least a bit of comfort that someone is doing something in the wake of these surprising and sometimes uncomfortable events.
In reality, learning is actually quite difficult to avoid. In other words, the answer to “Are you learning from incidents?” will always be “Yes.”
Some better questions might be:
- What (specifically) are people learning?
- Who are developing new understandings based on what they’re learning? How might we know they’re learning the same things as others?
- When are they learning these things?
- Where does this learning take place?
- How far does this new understanding travel?
(I’d like to dedicate an entire separate post to elaborate and explore those questions; we have experience with them in our assessment projects.)
When we look closely at post-incident artifacts, we find that they can serve a number of different purposes for different audiences. I recently spoke at the PagerDuty Summit and touched on this topic.
When the audience is “management”
In some cases, writing up a post-incident review is a “due diligence” demonstration for management levels above you in the organization, as a way to show that the team is “doing something” in reaction to the event.
Some organizations will produce a specific and brief description of an incident for management consumption. These are typically not designed to carry many technical or narrative details about the event, other than telegraph:
- An incident happened at this time
- It is significant for these (distilled) reasons
- We are doing these things in response
When the audience is potential new customers
Post-incident review documents that are made publicly available (in a blog post, email, etc.) serve a similar purpose as write-ups that are written with management as the audience: to acknowledge that something happened and that something is being done about it. The primary goal of this artifact is to build trust with those who are reading the document, whether they are a current paying customer or not.
This trust is typically built by describing in some detail “what happened.” This level of detail (what to include, what not to include, etc.) is critically important, and authors of these documents give it much more thought than is generally acknowledged. Put in too much detail (technical or otherwise), and readers will scroll past your wall of text. Put in not enough detail and readers may read that as hand-waving and a signal that the incident isn’t very well understood.
One way of testing a post-incident review narrative:
- Does it reference what actions specific (named) people took during the response?
- Does it reference what hypotheses were generated during the handling of the incident (and who specifically offered them) – especially if those hypotheses turned out to be unproductive?
If answers to those are “no” then you’re not likely reading a post-incident review document intended on generating or disseminating much insight about an incident.
Therefore, public “postmortem” documents are not aimed at capturing and sharing actual insights about the incident for insiders of the company to consume – because they’re not designed to. They’re designed to build trust with external readers, which is entirely reasonable for a business to do. They’re just not concerned with aiding the generation and dissemination of insights internally.
When the audience is existing customers with contracts
Sometimes, organizations are contractually obligated to produce a report of some sort that describes an incident and along with this description various categorizations or qualities of the incident that may trigger some refund or future credit for the customer. These descriptions are legally binding because they’re dictated as part of a contract.
Very few organizations we’ve come in contact with have ever referred to these sort of descriptions as actual post-incident review artifacts; they’re usually named very specifically as “Incident Report” or “Root Cause Analysis” or something along those lines.
Suffice to say that these documents (again, by design) are not concerned with actual analysis or insight generation of an incident.
Multiple audiences, multiple purposes
I’m sure readers can come up with many other audiences and purposes for post-incident review documents. The point of this is post is to simply acknowledge that these multiple audiences and purposes exist.
Acknowledging this is a step in the direction of evolving and shifting our perspectives on what “safety” means in software engineering and operations.