I was honored to present a talk at the AllTheTalks conference a few weeks back.
tl;dr: slides are here, video and transcript is below
The topic was incident analysis (big surprise there!) and the notion of learning and fixing, and how these activities are related but not the same.
A key idea here is that rather than focusing on simply focusing on identifying fixes for parts involved in the event and instead focusing on developing a richer understanding of the event, a much greater ROI the effort will result, and that will include more effective “fixes” and more.
By a “richer understanding” I mean facets and elements of the incident that include:
- the origins of the event
- the history of other relevant projects or incidents in the past
- what was difficult for people attempting to understand what was happening
- where the event came from,
- the scope of what they understood
- what mattered about the event at the time
you’ll have a better shot at capturing what is interesting for those who:
- were not there at the time,
- who were there at the time but will want to revisit details that they’ve forgotten about it sometime in the future, and
- who might not even work at the company yet.
Here is the ~30 minute talk, with the amazing Sasha Rosenbaum as my host! I’ve included an interactive transcript as well.
the first is I'm going to assert that current and typical sort of conventional approaches to learning from incidents have very little to do with actual learning. from the title. You can understand that my
opinion is that that learning is not the same as fixing. And then we'll talk a little bit about that and what I mean by that. and also assert that most post-incident review documents are documents that are written to be more often filed, not written to be read, that has maybe some obvious or not so obvious implications on what's learned fr
om them.
I would also say that, that changing the primary focus from fixing to learning will result in a significant competitive advantage. Now it's going to be quite difficult, to sort of, make a switch. But let's see what I mean by fixing and learning and see, see what you think. We can talk about it later in Q and. A. So when, as engineers are, you know, we're somewhat hardwired to some extent, our, core, rationale for being is to fix things is to build things, is to adjust things, that sort of thing. I'm going to say that when you focus on fixing things, they tend to get fixed pretty quickly. And, and that is sort of expected, right? But I would also say this is that the quality and effectiveness of this fixing and otherwise preventing future issues is going to be proportional or certainly influenced on how well you understand how the thing works to begin with.
And this is a real, this is a real distinction. Again, the point that I'll keep sort of trying to hammer home is if your goal is to fix, you're going to fix something. Whether or not it was the right thing to fix or fixing it in the way that, that is, possible. Even if you were to take a step a little bit to get a broader understanding, is, is really up in the air. Another way of saying this is that attention paid to learning will always yield a higher quality. Fixing that is that I'm feel very confident about. If instead on the flip side, you focus exclusively on fixing that can be a barrier for learning because your, you have youth. As soon as you have presented a plausible way to to fix something, then it's tends to shut down under the under sort of time and production pressure shut down. Any other options? So I want to take a look at sort of the conventional view on where post-incident analysis. how usually goes it, there's an incident and or some time happens, maybe somebody will do some work to make a timeline. this is usually at least typically conventionally where the industry believes th e, the work happened, which is this meeting, by the way, it's very expensive. Meeting the burden on this meeting to sort of solve all of the problems so that it never happens again is very, very large. maybe somebody will write a report. Hopefully somebody will write something up cause there are people who will be in positions where they weren't there at the incident, they weren't at the meeting. but the where the tendency to think that there only are greatest value is here on these action items. Now, if the action plans are the only goal you've gone, then I would throw out, go out on a limb and say that you're missing out on a significant amount of, of what you could be getting out of the incident.
So learning, let's talk about learning for just a minute here. I would say that learning is happening all of the time. It's a core part of being human, not just, engineers. The thing that I would say is there is no, there's no, real distinctions, you know, crisp line up. There's that people are not learning or not learning from incidents or we're learning for business. The fact is that learning's happening. The question isn't whether learning is happening, it is the question what is learned? Who is learning? When are they learning, how are, and how they learn. All of these depend on how well practices are set up to support it. If there are conditions set up by peers and just technology and leadership and all sorts of the sort of the business support, that makes it a priority to learn from incidents. But then, then it's possible. If it's not, then it's not,
this might be reasonably obvious, but I think definitely in, in the, modern complex software applications, no one can understand everything about everything. No single person can do this. we should be surprised given this, that things work as the, as well as they do. Now. When you think about this for just a moment, no one has an objectively comprehensive understanding of how everything is
supposed to work. They collectively with their peers, recalibrate how they understand things. Work incidents are a big part of that and it's still not comprehensive. The fact is that they know enough about it to make it work as well as it does.
Finally, I would just throw here that the prequel of incidents has really nothing to do with how well an organization learns, right? It might be a signal of something, but if the number of incidents or the rate of incidents is going up, that is not evidence that the organization is not learning if it's going down. That is not evidence that the organization is learning. A conventional myth is that, we have an incident and we'll think of it as these canonical set of lessons can then sort of extract it like a concentrated right from this incident. That's the whole point, right? Which is then sort of shared or sort of, sort of transmitted to a group. when we have this idea, which is, here's this incident, Oh, here they are, these beautiful, delicious lessons that are concentrated. So, you know, be careful. Now consume them. this, when we have this perspective, the perceived problem to be solved then is how to somehow share better. It must not be because of the lessons, but because of the sharing. Now I would say sharing is a little bit funky here, right? Because in some contexts, if I were to share something with you and you did take it or didn't even know it was there, did I share it really or did it really? I just make it available to you. That's a digression. The reality is that different people will always have varying understandings. They have different perspective before and after an incident. And what questions or mysteries remain for them can't be captured in some sort of one size. That's all that that goes to sort of the masses. If I tell you something that you already knew, did you learn it? Did, did you learn it from that lesson? I'm not entirely sure that's true. What is important or notable or interesting to somebody will differ from person to person and can also be valuable to each of those people. You can know something that your peers knew and you still, that this turns out what's important or notable or interesting is what they asked what actually remembered. Now, this might be a somewhat controversial, I'm going to say this, that if you can't remember something, you can't say that you've learned it. I'm feeling very strongly about this since my kids have been home learning. if
I ask you and we do my organization, when we talk to, organ, teams, when we talk to engineers, we always ask them about yesterday we asked them, we say, can you tell us about an incident that was significant? Tell us a story about an incident.
Well, the first thing come to mind is I think in the hundreds of times we've asked that there's not one engineer who has failed to come up with a story. And here's the thing, they're amazing. Not one story. Did I ever think, Hmm, that's okay. That's fine. They were all amazing. When we ask people about incidents, a couple of things are common. One is they become very animated when they tell the story. Oh yeah, listen to this. Let me tell you there's this one or I can tell you about this one story. Yeah, let's tell you about this one story. They also tend to include some sort of elements of suspense in how they say it. Even engineers that wouldn't describe themselves as storytellers, absolutely put a lot of effort, whether they know it or not, into how they tell it. That's what makes for this story. They include elements of surprise. Like for example, what we didn't know. So, so we went to, you know, fire up them cash. But what we didn't know what the term was, they'll also some sort of give us some sort of backstory. Right? So, so while this was going on, remember we were streaming the Superbowl and this is what was happening. Or remember, we are, you know, we're two hours away from our company IPO selling. This was what was happening. They recall it most amazingly in ridiculous, really fine detail, even if it's many years since. And all of this fits with what we know about, learning from incidents, from other domains, stories that people remember, have elements of challenge, struggle and difficulty. So therefore, if you want to broaden and deepen the, the, understanding of how things work that surround an incident in your organization,
you should consider doing that.
One way of saying this is that interesting incident analysis documents. Let's set aside the meeting. Interesting documents get read compelling incidents get read and they get shared with others. You know, you're on the right track. When engineer's sake, listen to their colleague, you gotta, you gotta read this out, don't care what you're doing, just if you're not deploying to prod, just, just, and when you're done deploying them, it's fine. You gotta read this, you gotta read this, check this out. Fascinating documents get read, they get shared with others, commented on, asked about, they get referenced in code comments and in pull requests you'll see them reference in architecture diagrams, in other incidents, we'll refer to them. We've seen it, we've seen compelling write ups. So demonstrative, so illustrative of what's happening in the, in the software in the end, in the organization is in that it's been included in new hire onboarding. Here's what's important is that uninteresting documents don't ...any of it.
some of you may have seen or heard of a movie called Green Lantern. The point I want to make here is that just because something's made available to open and everywhere doesn't mean it's any good. The reason why nobody went to go see green lantern is because it's terrible. Incident write-ups are the same. So make an effort to highlight the messy details. The messy details are what makes people want to read. Want to ask more questions. What was difficult for people to understand during the incident? What was surprising for people about the incident? How do people understand the origins of the incident? What mystery still remained for people. There are mysteries that always remain, nobody has a comprehensive, objectively true universally understood understanding of how an incident came to
be and how it took place and what to do about it later.
I want to know what was difficult about an incident and I want to be able to ask questions about that. This is something that all engineers could consider internalizing.
One point that I'd like to throw out, which is because it's quite typical, is that you only pay attention or pay a deliberate effort to understand the
incidents that have high severity or high customer impact. But here's the thing, customer impact is not equivalent to the difficulty of solving the issue. They're orthogonal. Second point is that multiple difficulties can exist in the same incident and fielding questions about was about what was or is still is difficult is that's the gold that is the critical, in, that is how lasting memories are formed. And that is how people who were there and were at a meeting, all, sort of maintain throughout, throughout time.
So multiple people have multiple stories. We talk a lot about mental models, but mental models are really stories. They're not pictures. They have. There's a an understanding and people can quite often feel as though until an incident happens, be pretty confident in there. How it works is how, how I think it works is how it works. What incidents do is allow us to take what we have in our mind and super impose
it and sort of contrast with others. And a great deal of it overlaps with what other people think about how it works.
This is using this opportunity for actual learning. So what are some fine goals for a post-incident meeting when participants in a
group meeting this big expensive meeting, when they leave the meeting, knowing new things they didn't know when they entered and new things they didn't know about what their colleagues know about the incident. Things are going well. Ideally they'll also have an understanding of how to continue discussions and where to capture it. If you think about it for a minute, if the default is you've got an hour or two hours or some, common regular cadence for how post mortem or post incident or after action or learning review, whatever you would like to call them, what these meetings are, that seems strange to spend the same amount of time in every incident despite the fact that the incidents can vary in so many different directions.
So continuing discussions and where it's all captured is significantly important. The same thing happens for a writeup when you're done reading a writeup,
if you know things now that you didn't know when you started. And do you know things that you didn't know about what your colleagues know and how to continue discussions and where to capture it. This is what makes an post-incident review. Write up something worth reading and something worth sharing and something that could be valuable to people in the future because you're not writing it for now. You're writing it for people who will want to read it and need to read it in the future. So here's some things to experiment with.
Separate the generation of followup action items from a group meeting, right? Instead of having the whole goal of that meeting where people are rushing to get follow up items, separate it, make that meeting about understanding record in the document who responded to the incident and who attended the group meeting. This is supporting your reader capture things that were done after the incident of before the group meeting in the document. Give write ups to brand new engineers and ask them to record any and all questions after they've, after having read it link. This is an easy one. Link jargon terms to documents that describe them in the incident. Review documents. Ask more people to dog diagrams. Visual representation can be really important and make it easier for people to read and understand.
Have someone who is not involved in the event lead the analysis. Ryan Kitchens wrote this sort of blog posts that I've linked to here. He says this, if the incident analyst participated in the incident, they will inevitably have a deeper understanding and bias towards the incident. That will be impossible to remove in the process of analysis. If you were there, you're not necessarily going to investigate things that you are pretty sure you already know. Lastly, resist focusing on reducing the number of incidents. Ryan continues on. He says the cliche idea that we would do this work to reduce the number of incidents or to lessen the time to remediate is too simplistic. Of course, organizations want to have fewer incidents. However, stating this as an end goal actually hurts organizations. Indeed, it will lead to a reduction in incident count, not from actually reducing the number of incidents. The rather lessening how and how often they are reported. I would add to what Ryan says here and say instead, focus on increasing the number of people who want to read these reports and attend this post incident review meetings. So that's the talk that I'm not, hope. I hope people have questions. I'm not entirely sure if they do, but that's the, that is the talk as it stands.
in some cases, how a, there's nothing that says that a, at least nothing that I've, that I've seen, you know, sort of has a procedure for how to run a postmortem meeting on a minute by
minute basis. there are still, as much as I'd like to rail on, on, sort of, constraining post-mortem templates to fill out. most of them, many of them have open text fields, right? The, here's the thing, the stories like I've described are happening.
They're just not seen as the thing, right? There's an official bookkeeping sort of accounting and you, you know, you fill out the form, you do the thing you say, no. You say why, you know, a certain number of times root cause and all of that. I can say that, I'm certainly not a fan of, of sort of declaring organization-wide that something new is different. and, and, and you're flipping from some world to another world. what we see more often is, enthusiastic individuals inside the organization doing things just a little bit differently in addition to what they're already doing over time. people will be engaging with that extra stuff more so than the stuff that they
had been previously doing. Cause what they're previously been doing is terrible. and, and, and you just have to, you just have to give someone a post incident review.
you know, I ask people all the time, we say, well, do you, do you have, do you have postmortems? And they say, they say, Oh yes. So, you know, of course we, you know, we never want a good crisis. Go to waste and sort of insert some cliche there. And then I say, well, I did. I said, Oh, that's excellent. I say, well, who reads them? And nobody has an answer for that. but I can tell you that I w you know, what we hear is that nobody reads them cause they're not, there's nothing. You can't get much out of it and it's easier to go to your colleague and say what happened.
what is well, what was important to what was difficult, what was, what was notable about a given incident. And over time that they started looking more at that extra stuff, asking questions more about that and less about, you know, why is the, why is it that, you know, why the number going down? and it, and that way it's, I'd say it's a little bit of subterfuge, a little bit of, you know, it's a little sneaky, but it's, it's it's effective.
paraphrase and attributed in an incident review. but, the, the, the, the content generators are there. and, and you know that they're there because these are the stories that you hear after work. These, the stories that you hear during lunch. These are the stories that you hear over drinks. These are the stories that you hear in the hallways of conferences.
and, yes, so, w when we find engineers telling stories about interesting outages, every engineer in the room is quiet and wants to listen very carefully. so that I believe the power's there.
you don't have some way of gathering data from people. qualitative sort of, you know, just reflections on what was hard, what was surprising, that sort of thing. then, you know, if there's not a meeting, you can't convince somebody to have a meeting. you might be able to convince somebody to talk to you for five or five or 10 minutes at a time. once you, what we have seen is somebody doing that and then sort of synthesizing what
they've heard into sort of a document, at which point later, people, you can use that as well. You know, w we could, you know, we could have a meeting, we could, you know, just, and we can experiment with it. Worst cases. It's a terrible meeting, but I'm sure we already have terrible meetings and we can't get rid of them. So how about just adding, trying one more and trying to make it a good meeting.
That's the best answer.