We have written before that documents written about an incident can take many forms and structures, depending on the author(s), purpose, and target audience.
The goal of this post is to describe what makes public-facing articles that companies publish about incidents different from internal write-ups representing an effective incident analysis, and a rationale for why this difference matters.
While many companies often share details about incidents they’ve experienced, some provide more compelling stories than others and, as a result, can generate a lot of attention from software engineers working elsewhere. An example would be this summary that Cloudflare wrote about an incident they experienced in July of 2019. We’ll discuss what makes an article like that one more captivating (and discussion-provoking) than others in another post.
Characteristics of internal incident analysis write-ups
Internal write-ups tend to be written with boots-on-the-ground practitioners in mind, although in some cases, they can also be written for leaders in the company as the primary audience (see below).
The purpose of an (effective) internal write-up is to represent the richest understanding of the event for the broadest possible audience of internal, hands-on staff so that they may benefit from the experience.
These benefits would include things such as:
- An informative description of how the various mechanisms, composition, and dynamics of the technical systems involved in the event work — what they’re expected to do, what their relevant history entails, etc.
- A better understanding of what operational vulnerabilities, fragilities, or surprises can show up in these below-the-line bits.
- A richer picture about what to look out for in designing or operating similar systems in the future.
- A better understanding of what was difficult for people as they wrestled with what was happening in the event — and what made those difficulties…difficult.
- A greater appreciation for what pitfalls or minefields exist to lead them to a mistaken understanding, what options they might generate for taking actions to remedy the situation, what options are not available to them, what steps may make the situation worse, etc.
- When the audience is primarily for leaders and management (not hands-on practitioners), the write-up’s sheer existence often demonstrates that boots-on-ground staff are exercising due diligence.
Even when the write-ups are aimed at practitioners as the audience, the primary purpose can be a demonstration that those responsible for the technical areas (sometimes called “owners”) are doing something about the incident, almost always in the form of follow-up “action items.” In these cases, the narrative takes a back seat and rarely contains detailed descriptions of the event from multiple perspectives.
Because the audience for internal write-ups are “insiders” who (potentially) have more historical context (and access to background materials), the actual content of the narrative tend to include things such as:
- shorthand names or jargon familiar to people who work at the company
- links to other internal systems (wikis, diagrams, ‘runbooks’, dashboards and other related tooling, bug tracking, etc.)
- graphic representations of the systems involved
- the names of the authors
- the data they collected for the analysis, and the methods they used
- the number of people or names of teams who responded to the event, when they joined or left the response
- What they actually did during the incident, as they:
- tried to make sense of what was happening
- generated and discussed potential hypotheses
- came up with options on what actions to take and what ramifications those options might have
- discussed trade-offs or dilemmas they might be facing
These internal write-ups are almost always written by practitioners, not by people in departments such as PR, Legal, Marketing, or Finance.
Characteristics of public-facing articles about incidents
In some cases, companies will send descriptions confidentially about incidents to paying customers as part of their contractual obligations. While these (sometimes called “RCA”s) tend to be similar to those posted for the general public, I’m focusing here on articles or blog posts available to anyone to read.
Almost exclusively, these pieces are written for readers who do not work at the company, including:
- Customers of the service (both current and potentially new)
- Journalists and others looking to report on the event in the media
- Technology analysts (Gartner, Forrester, etc.)
While they are not often acknowledged as an audience, engineers working at other companies do read public posts about incidents, especially those with technical details. In this way, these descriptions can serve — albeit in a minor way — as a passive recruiting effort by ‘pulling back the curtain’ a bit on what work looks like at the company writing it.
The purpose of public posts about incidents is, ostensibly, to provide readers enough detail (but no more than that) about the event that satisfies them that the incident is understood by the company. The idea is to restore the (potentially shaken) confidence of the audience.
This effort is a means to an end: customers continue paying, new customers are undeterred, and tech analysts will take the ‘transparency’ into account when they write their assessments of the business’ services.
As a result, the sentiment typically expressed in public posts is in the form of:
“We know what has happened, and we’re fixing it. We understand this was disruptive for you as customers, and we apologize for that.”
This characterization is obviously thin; the point is to underscore that the primary driver for posting publicly about the event is to signal to outsiders that the inside of the organization acknowledges that the event took place and that they’re paying attention to it. This is a completely reasonable action to take from a business perspective. In many cases, messaging like this is also driven by legal/contractual obligations.
In any case, restoring confidence is the core rationale of public posts, above all others.
Public posts are very often cherry-picked from the stream of incidents the company experiences, and typically are written about severe and/or high-profile events. The content of these posts commonly include a few characteristics:
- They almost always include some form of apology. This is often accompanied by a commitment to future improvement in order for the event to “never happen again” and high-level descriptions of what actions the company intends to take to make good on this commitment.
- They are often written (or at the very least edited and/or approved) by staff who are not hands-on practitioners, often by folks working in PR, Legal, Marketing, or Finance.
- They almost never contain shorthand names or jargon known only to internal staff, unless it’s critically necessary for the narrative.
- They (obviously) do not link to relevant internal systems for readers to learn more about details (wikis, diagrams, ‘runbooks’, dashboards and other related tooling, bug tracking, etc.).
Something notable about public posts about incidents includes what they do not contain. They very rarely link to other public posts about incidents published in the past. To make a connection from one incident to another (for example, a “repeat” incident) would undermine a core purpose of public posts: to assure the audience that the incident is sufficiently understood and future prevention is “ensured.”
Again: so what?
To be clear: it’s not only reasonable that public-facing write-ups differ from what internal analyses look like, it would be surprising if they weren’t! Code comments, commit messages, and wiki docs all focusing on the same bit of tech also aren’t mirrors of each other either, because they take into account the context a potential reader might have.
It can be tempting to read about another company’s incident and assume that the story presented represents the story insiders also have about the event. There will certainly be commonalities between the two versions, but the quality of detail needed to restore confidence has to meet a much lower bar than the quality of detail aimed at enabling an engineer to better understand the world they’re working in currently and in the future.
Here are some questions that come to mind for me when I read public articles about incidents…
- What details (as an engineer) seem to be glossed over in the description? What would I want to know more about? If I were to ask the author for this detail, can I imagine they’d offer it?
- What elements of the piece stick out to me that are not often found in public posts, such as diagrams, screenshots, snippets of log lines or other things that people shared with each other as they tried to understand what was happening? I’m always curious about what brings some authors to include these bits in some cases and not in others.
- Do I read any couched language intended to paint a rosier picture than what customers experienced?
- Can I see relevant connections (even if the author doesn’t explicitly make a connection) between this particular event and previous ones at the same company?
- Is there anything in the description about what was difficult, ambiguous, scary, or confusing for them in handling the case – or is the article mostly about the technical mechanism of the event?
Perhaps the greatest value of public-facing posts about incidents (at least for me) isn’t the great effort companies make to restore my confidence in the service, but that it’s an opportunity to generate new and better questions and community discussion about incidents in general? ????
UPDATE: The Verica Open Incident Database (VOID) has now launched! Their initial report is here. What sort of questions come to you when reading these public posts?