Case: Subdivision of SaaS Company
This client asked for an analysis of recent incidents (which had attracted media attention) for insight into how they might improve their event response. They were especially concerned with improving their capacity to respond to incidents that involve many disparate groups across the company, which had been happening due to surprising interdependencies/interactions between services.
We performed a meta-analysis of recent post-incident review artifacts and had an opportunity to observe two “medium severity” incidents during the engagement.
- Explicitly connecting post-incident review documents to those of past incidents that had similar cross-team interactions in the response.
- Inviting members of the customer support organization to participate in the group post-incident debriefings and collaborate on the write-ups of the events.
- Separating the generation and evaluation of follow-up “action items” from the post-incident group meeting.
- Giving the documents generated by the post-incident review process to engineering new hires in all teams as “on-boarding” reading exercises.
Markers of progress
- Post-incident reviews and analysis now have an explicit phase for analyzing incidents that have a large number of teams responding, and groups representing the core interdependent services now hold a monthly meta-review of recent incidents to fuel ongoing “operability” roadmap items.
- The extension of the “dependency” tooling project created opportunities for implicit and tacit knowledge (in multiple teams) about the various services to be made explicit via pair programming and “swarming.”
- Engineers responding to incidents involving service interdependencies now report having greater confidence in generating fruitful hypotheses and coordinating across teams.
- Architecture reviews for new services designs and legacy migrations now have specific phases focused on uncovering hidden interdependencies. The client has mentioned that staff engineers have put a copy of the “Dark Debt” section from the Stella Report on their wiki.