Case: Mid-size E-commerce Company
This client engaged us for an assessment of their organizational learning from incidents and to provide recommendations for improvement. They expressed concern about how insights were being shared and used to inform operational decisions across the engineering organization (in terms of roadmap changes, prioritization of existing work, etc.)
The client was also concerned that insights about how (sometimes esoteric) knowledge about systems behavior was limited to the team(s) involved in the incident and did not spread further to others.
Findings
We found a number of what we called “positive raw ingredients” that supported post-incident learning:
- There was some (brief) post-incident review training in place, open for any employee to join, and there had been a recent effort to train more non-engineer staff.
- The client maintained a library of events and incidents, and the cases included multiple artifacts (chat transcripts, dashboards/plots/graphs/log lines, command outputs, etc.) in an attempt to help characterize both the incident and the after-analysis for readers of the case who were not involved with the response nor attended the post-incident debriefing.
However, despite this active investment/commitment to deeper post-incident analysis and knowledge capture, some insights were uncovered:
- There was a wide variation of expertise in the facilitator-analyst group. A small number of facilitators who were well-known for their skill in getting “ah-ha” moments in debriefings and extracting critical-but-esoteric knowledge from senior/expert engineers were requested for leading almost all post-incident debriefings and analysis. This resulted in an over-subscribed few and an abundance of less-skilled facilitator-analysts who were not gaining experience.
- The library of incident review artifacts was largely treated as a “write-once/read-rarely” resource. With the exception of one engineering team, no groups referenced past incidents explicitly as influences on new designs for software or architecture. The team that did make use of these incident documents did so very effectively. Their “design doc” template used for discussing new proposals for architectural changes or introducing new languages routinely referenced diagrams, graphs, and even chat transcripts as supporting evidence for their proposals. As a result, newly hired engineers on this team reported being more confident in taking on-call responsibilities due to having reviewed the “weird things” that happened in past incidents.
Recommendations
- ACL provided training and coaching for an identified group of enthusiastic members, presented it as a repeatable workshop, and set up an initial cadence of shadowing/follow-up evaluation.
- The “bootcamp” experience for newly-hired engineers was modified to include a deliberate review of past incidents, including follow-up Q&A with engineers familiar with the events, documenting and sharing any new insights with the greater team.
- The team that made extensive use of the incident artifact library in their new design and architecture review discussions planned to give an engineering org-wide talk about their practices, and leadership is considering making those standard in all teams.
Markers of progress
- Almost all engineering teams have adopted the practice of referencing post-incident review documents in other documents, such as new architecture diagrams, product roadmap justifications, run-books, etc. Engineers reading these now report having a broader understanding and context for decisions and plans.
- The incident library tool has seen a number of improvements that encourage dissemination: anchor links to individual artifacts (graphs, log snippets, etc.), analytics to record how each of the incident cases and artifacts are being referenced, a bookmarking feature, and search interface has been developed.
- Many more facilitator-analysts are in the rotation for new incidents, and a veteran-newbie shadowing/feedback practice is in place.