Adaptive Capacity Labs

Case: Subdivision of SaaS Company

This client asked for an analysis of recent incidents (which had attracted media attention) for insight into how they might improve their event response. They were especially concerned with improving their capacity to respond to incidents that involve many disparate groups across the company, which had been happening due to surprising interdependencies/interactions between services. 

We performed a meta-analysis of recent post-incident review artifacts and had an opportunity to observe two “medium severity” incidents during the engagement.

Recommendations

We made a number of recommendations, many of which were straightforward adjustments to practices already in place. Other recommendations included introducing new practices. These recommendations included:

  • Explicitly connecting post-incident review documents to those of past incidents that had similar cross-team interactions in the response.
  • Inviting members of the customer support organization to participate in the group post-incident debriefings and collaborate on the write-ups of the events.
  • Separating the generation and evaluation of follow-up “action items” from the post-incident group meeting.
  • Giving the documents generated by the post-incident review process to engineering new hires in all teams as “on-boarding” reading exercises.

We also recommended that they officially sanction time and attention given for developing and maintaining incident analysis expertise in the company, and provided a schedule for ACL staff to provide shadowing/coaching for incident analysis.

The client had in place a distributed tracing tool and our recommendation was to extend it to include a higher-order scope of internal API interactions. This required the collaboration on the development of a “dependency discovery” tool by representatives of multiple internal services.

Markers of progress

  • Post-incident reviews and analysis now have an explicit phase for analyzing incidents that have a large number of teams responding, and groups representing the core interdependent services now hold a monthly meta-review of recent incidents to fuel ongoing “operability” roadmap items.
  • The extension of the “dependency” tooling project created opportunities for implicit and tacit knowledge (in multiple teams) about the various services to be made explicit via pair programming and “swarming.”
  • Engineers responding to incidents involving service interdependencies now report having greater confidence in generating fruitful hypotheses and coordinating across teams.
  • Architecture reviews for new services designs and legacy migrations now have specific phases focused on uncovering hidden interdependencies. The client has mentioned that staff engineers have put a copy of the “Dark Debt” section from the Stella Report on their wiki.