Quite often, we will talk to new clients about what “learning from incidents” means to them. The descriptions we get in response tend to be important signals for us as we work through whatever project we’re engaged with.
Invariably, people will mention signals that learning is taking place in the form of what we might call shallow data, such as:
- A decrease in the frequency of incidents they experience
- A decrease in the impact incidents have on the business’ goals
- The length of time incidents last (or “time to resolve” or “time to detect” or “time to X”) will go down
- The number of completed post-incident “action” items will go up or the length of time it takes to complete them goes down
What we can say definitively is that learning is happening, regardless of what happens in post-incident activities. People are always learning. The challenge is not getting people to learn. The problem is making it easy for people to learn what is likely to be useful in the near term and the more distant future.
Detecting that effective “learning” from an incident has taken place is quite difficult to do. Making progress in learning from incidents is difficult to capture and characterize. However, there are a number of potential indicators that, taken together, could provide evidence of progress in learning from incidents.
These markers or indicators include:
More people will decide to attend post-incident review meetings. Meeting attendance will grow. Engineers will report that they learn things about their systems there (and in the incident analysis write-ups that result) that they can’t anywhere else.
Post-incident review meeting attendance will include people from engineering and customer support not directly involved in the incident under discussion.
Engineers will actively seek focused incident analysis training. They will express interest in topics related to accident investigation and read more on these topics on their own time.
Tools that aid incident analysis and post-incident review meeting preparation, or enrich the post-incident artifacts will appear and be refined.
The number of “orphan” post-incident “action items” (in JIRA or other task-tracking systems) will trend downward. Orphan items will be “adopted” by being reviewed and cross-referenced to incidents and post-incident analysis write-ups.
Post-incident analysis document content will become richer (e.g. include diagrams drawn by participants in post-incident review meetings, the actual transcripts of the incident response and handling, contributions from customer support staff).
The number of unique readers of post-incident analysis write-ups will grow over time. Even months after the analysis is published there will be new views of the document(s). Comments, replies, highlights, tags, and other metadata regarding the content will come from an ever broader audience and spark new dialogue between readers.
Incident analysis documents will be used in new-hire onboarding or training as vehicles to describe in rich detail the histories of involved technologies, the challenges and risks faced by teams, and configuration of systems and dependencies.
Incident analysis document content will be written and organized to make the incident features (sources, conditions, difficulties in handling, etc.) explicit enough that future readers will be able to easily find and understand them. There will be regular evaluation of past incident analysis documents that confirm this.
Engineering teams will use incident analysis documents as primary training materials.
Explicit references to specific incident analysis documents will appear more frequently in company internal documents. Citations of specific incidents in project/product “roadmap” documents, “runbooks”, hiring plans, new systems design proposals, etc., are evidence that the authors understand both the value and the relevance of experience with incidents.
Incident analysis documents originating in engineering groups will routinely be reviewed by those in other groups (such as customer support). Comments from these groups will be included and cross-referenced in the post-incident documents.
Post-incident documents originating in other groups (such as customer support) will routinely be reviewed by engineering groups.
We understand that these phenomena above can sound unbelievable to some, but we have seen them. At one time, the idea of deploying to production multiple times per day also sounded unbelievable. Until it wasn’t.
Want to believe? We can help you.