For those who like oversimplifications, here is a bulleted list of “takeaways”:
- Real incidents (and the response to them) are messy, and do not take the nice-and-neat form of: detect⟶diagnose⟶repair (or derivative versions of it)
- The experience that engineers have during incidents are palpably familiar, critical to the evolution of the incident, and also poorly captured in typical post-incident documents.
- I believe post-incident review (or “post-mortem”) templates are at best inadequate (and capture mostly shallow data) and at worst crutches for an organization that provides only an illusion of valuable analysis and learning.
The central premise of the talk is that there exists a gap between how we tend to think incidents happen and how they actually happen in reality.
I began the talk by asking the audience a couple of questions. I asked people to raise their hand if, in the course of responding to an incident:
“…they have ever experienced such a profound sense of confusion about what signals and symptoms they were seeing — that they were at a loss to come up with any plausible explanation?“
Many hands went up, along with some nervous chuckles and nodding heads.
“…when they were about to take some action — intended to either “fix” the issue or preventing it from getting worse — had experienced a sort or “fight-or-flight” feeling of uncertainty about whether this action would indeed improve the situation or possibly make things worse?”
Many hands were lifted again, along with more laughs and nodding heads.
I said that how we imagine incidents happen can have a significant influence on what and how we learn from them. How we imagine they generically happen influences:
- what we think is important about the incident…and therefore what we dismiss
- what we discuss about the incident…and therefore what we don’t
- what we write about the incident…and therefore what we leave out
In other words: what is deemed important gets discussed, and only what gets discussed has a chance of getting written down. If it’s not captured in a narrative form for readers in ways that can evoke new and better questions about the event (because there can’t be a truly comprehensive account of the incident) – then it will be lost.
I also reminded the audience that post-incident reviews can serve multiple purposes and have multiple audiences.
Here is the video for the talk:
picture of the future. I'm just gonna throw out the idea that even in that beautiful future,
there will still be on call rotations. So it looks like my presenter notes are a little bit weird. Ah, there we go. That'll, that'll, that'll work. so, talks in the morning of, of conferences, you either are, have to be inspirational or thought provoking. This is likely to be thought provoking leaf. That's, that's my intent. so before we get started, I want to get it as sort of a shape of the room. Raise your hand if you're on call right now. Okay. So if, if you just, you know, bolt up and like have to get out of the room, don't worry, I'm not going to get unglued about it. I've been there. Good luck.
two more questions. raise your hand if you've ever responded to an incident and experienced such a
profound sense of confusion about what you were seeing, but what signals and symptoms you were getting that you're at a loss to come up with any plausible explanation.
hit entered, before you take this action, have you ever felt pretty uncertain that there are at least a certain that there
was an equal chance that you are either going to fix it or it will do nothing or it will actually make things worse? Raise your hand if there's a, yeah. So me too, I've been there and a referred back to these a little bit later. So the backdrop for this talk is the work that I and some of my colleagues have been doing and studying cognitive work and specifically cognitive work in the wild. You all in the SRE and DevOps worlds. some of you may have been maybe familiar with the Stella report. this is the, if you haven't seen it, please go. Please go check it out.
As the result of a year long project. This is the first cycle, in a consortium that we call the snafu catchers. And here are some of the folks, that was, was part of that as industry partners. The second cycle is, is underway right now. this morning I want to talk about our understanding of incidents in our industry and a different in looking at them in ways that were maybe not really used to.
So in the course of our work, we see that there's a big difference how the industry imagines incidents happening and how they actually happen in reality. And here's just the nutshell is that in reality, incidents are so much messier.
They don't unfold the same way that we described them in our blog posts or the
way we talk about them in our postmortems or the way we even give guidance to each other. And so you might ask, well, so what, right, I can imagine, we can imagine how we imagine incidents influence a whole bunch of different things, influences, what we think is important about the incident, right? If we, if, if it doesn't fit what we think is important in our general sort of abstract
view of the incident, then we might not pay much attention to it. It influences what we discuss about the incident and then, and therefore what we don't discuss. It also influences perhaps more importantly what we write about the incident and
what we leave out. And the, the, the key sort of connecting sort of logic with here with this is that what's deemed important in an incident gets discussed. What gets discussed gets written down. And what gets written down gets shared. We'd like to say we learn from failure. It's an easy throw away line. No one will argue with that. Also, how does learning work?
That's a really hard question. I pretty sure we just heard Ray Kurzweil saying his spent his entire career trying to understand that. So before we get started about this, this gap is different. So I want to put out something that's perhaps obvious, but only when it's strung together in that is there are multiple purposes and multiple audiences for incident analysis. Certainly for what we do with it, right? Maybe we do postmortems because they're effectively due diligence demonstrations, right? For management because we need to show them that we're doing something.
Maybe it's to show good faith to build trust or transparency in like a, a public blog post, right? With new customers. This is also very reasonable. I will say that it's likely that these two different routes are likely to be
very different. The, the, the products that you put together for them. Maybe you've got an agreement or a contract and some legal lease with some SLA i s we have to do some stuff, some obligations that way. or we might have some auditors or regulators
or maybe we're doing incident analysis so that people in your org,
maybe not even just engineers, but people in your company have a better understanding, maybe a richer, deeper understanding of the incident, and what its sources of creation are, and its handling. And all of that sort of thing. Maybe...I think we have a feeling that this, number five, this, this, this last one is what we, what we generally think of at least the folks in this room, but we have to acknowledge that these other are, these other purposes and these other audiences exist. I will tell you this, that none of the, not all of these are aimed at generating new and better understandings of your systems, right? Some are not concerned at all about learning.
So with that caveat, I want to start with just sort of a generalized and abstract model. I'm going to put up a bunch of boxes and lines and arrows and the, the content of the boxes in lines don't really need to, to understand what's going on in here. these, these are diagrams that this is a more elaborate one sort of diagrams you might see if there's, is what debugging looks like or troubleshooting or diagnosis, right? I don't want you to, it's not really necessary to pay attention what's in the boxes. The thing that I want to call your attention to is look how neat and perfect those lines are. Right? Matter of fact, look at that.
There is absolutely no ambiguity that there's a difference between enumerate po
ssible causes and you use process of elimination. I mean, you know when you're in one of those boxes and you know when you've gone to the next box, right, so all of these models should look familiar. They're all sort of variations of the same and we were always repeating them, repeating them in descriptions and explaining them how incidents happen. It's a way of categorization. It's what we do. We categorize. Matter of fact, there's a handful of ones that we've, I've written these sorts of things.
These are ones, they're all variations along the same lines, that it's really easy to assume that these descriptions, these sort of abstract things actually represent the incidents. We know that that's absolutely not the case. That's what this talk is about. The danger is if we go in armed with these boxes and arrows. All right, we're just going to go into the postmortem, let's say now where did, okay, where on the timeline did we stop doing detection? Exactly right. Here's the most important part about these models. I want to ask you this, ms question. By the way, all of these questions I'm asking are rhetorical.
They're supposed to, you know, get you thinking, where are the people
It's really common to think of incidents as unfolding this way, but we have to make sure that this is a way for us to,
sort things. but when, when you, at incidents, as researchers, when we look at incidents, they never look this neat and orderly, right? Ready? But, but when one thing that's really great about them, neat and orderly, makes it very convenient to tabulate them across each other and compare them and ignore the messiness. So I want to take a look at an actual incident. And so this is an incident is as a, not even an inch deep, superficial, very surface oriented, scan across a real incident. And this particular team, they noticed that something was amiss. Something's a little bit weird.
they're not entirely sure. it's, if it's a big deal. And at the time of year, a couple of the engineers said, you know, actually made me these sort of elevated levels, aren't really that big of a deal. Kind of expect that. So they wait some time and, and there's, you know what, let's just keep it going to go home. They go to lunch. And I was like, let's keep an eye on it. Like, if this actually starts to, you know, if it doesn't start coming down or if it doesn't start to sort of resolve, then you know, we'll start paying attention. Sure enough, actually this is, this is something we got to pay attention to.
So they worked out what to do about a number of them worked out what to do, but then they spent some time to sort of fix it and then they, and they spend some more time to confirm it's actually fixed. So you can imagine a number of different, variations on this where's detection? You could
make an argument that it's over here. Could also make an argument that it's somewhere over here. Maybe that's here. Did they decide, wait, what's the difference between when some, when you've detected something and identified what it is.
for some period of time the backups were, were, were written, the tests that said that the backups were valid past right,
but they actually weren't, cause it turns out the tests are also written in software. and, but check this out. What's great about this story is that there was a outcome, a seemingly unrelated chef change under, where this sort of yellow stops here that actually accidentally fixed it, right? So now first of all, this should, as a former CTO, production database backups being corrupted is like bolt upright in bed, cold sweats. That's, that's serious business, right? What do we do when detection is actually after the resolution time?
So, cause if we go with the model we had, we're going to have a negative number. And I don't know if I want to have to explain that to the folks upstairs. You know what I'm saying?
So looking deeper at these incidents, we find a number of different things. One is that detection is different from identification or recognition. In fact, this is supported by research. These are distinct cognitive processes, at sometimes they, they occur right at the same time, to the extent that they're almost indistinguishable. Other times they're separated in time, of course. But if our model doesn't catch to that, then we're sort of out of luck. another thing is just that diagnosis is not always the most challenging aspect of these, of incidents. We'd like to think of incidents because it's, it's, it's, it's more fun and it's, and it's, more, powerful, palpable to think of them as mysteries, right. As a, as a, as a mystery to solve.
But that's absolutely not the case as that it's the most challenging. Sometimes you know what's happening. Sometimes you know what to do about it and the issue is actually how you're going to go about it.
So here's some data from another, another actual incident, right? And so you can see here there was a period in the beginning and there's a period in the middle and there's a period at the end you, you would be forgiven if there was like a detection and diagnosis or resolution or that sort of thing. And so this data in and of itself is, doesn't really tell you much about the, it certainly tells you how it will be tabulated, how will be accounted for. It doesn't really tell us much about the incidents. So what if we were to look what else?
What if we were to look at the people involved in the incident? We might be able to ask better questions like when people joined and who joined.
So I can see now I can see that there are nine people that helped out in this incident, but who are these people? So what have we looked at and is something that we worked out in the research is that you look at tenure. It doesn't always turn out to be a strong signal, but it can be an interesting signal. Tenure at a company can be an indication of where maybe pockets of esoteric
knowledge lie. It's because these people have some experience with the idiosyncrasies, this esoteric behavior. Do these people have different roles? What teams are they on? If we consider that the different roles, different domain expertise and different teams may generate different hypotheses about what's happening in incident, might have differing varying levels of confidence about what to do about it, then we might be able to learn something that some roles know. Some teams know that others don't. So let's go, let's come back to this. In this particular incident, this, this engineer called this other engineer. Right now, this was just, this is just behavioral data. They called them. Now, if we, if we investigate this, we might, we might want to know what expertise does this engineer know that this engineer has that other people don't know and know that it was needed.
These two engineers, followed a hypothesis that this was related to this particular lens. It was related to a load balancer and for very good reason, they, they believe that it turned out to be pretty unproductive thread. They worked that out. But in the meantime, that meant that their attention was not available for other things.
In this other case, this particular engineer, this is why I love this case. So much of this particular engineer started getting alerts turn and it turned
out that actually those alerts had nothing to do with the original incident, but she had to work out whether that, it turns out that it was a, a new alert. Those sort of criteria hadn't really been tuned. it turned out that it wasn't, but it still meant that she had to give attention and sort of focus to that. So at one point, the DBA feel starts to feel pretty pessimistic. That incident had gone on sort of pessimistic about what, what is sort of to come and starts prepping some, some data repair, some writing subscripts, making sure that some backups were, were, were in place, ended up, did not end up needing it, which was, which is great, but, but, but they were, they were, they were otherwise engaged there.
Interestingly enough, the, the, the engineer with the most familiarity with this particular part of the stack was at a conference and the wifi at the conference made it such that effectively they weren't able to even sort of engage very much. They were able to get some sort of staccato like guidance in, in, in the dialogue, the most productive line of activity that produced both the actions that, helped restore the service as well as prevented it from while the fix was,
was, was in play, did not come from some magical Eureka moment of one engineer. It was the result of three, two engineers on the same team, and another engineer.
It was about a building and refining of a hypothesis by and sharing some observations amongst each other. All the zoom in on this little bit here. So, sort of taken from the stellar report. when an incident happens and you, an organization or a team needs to bring in additional resources, additional attention, additional expertise, right? The people who are bringing others in face a trade off,
faced us or serve a kind of a quandary, right? And this tradeoff is that between effort needed to bring people up to speed
about what's happening and what needs to be done, or at least what is the current directions and spending time in the diagnosis. No incident response framework camps. This is a fundamental trade-off. No special brief response framework can rid you of this trade off the [i
naudible]. If you spend a really long time being thorough on bringing somebody up to speed, you wouldn't make a lot of progress, especially as people come in. So you're expecting them to get context that's missing from your getting them up to speed from say, chat, scroll back or seeing what other things are happening. But this is a fundamental bet that we have to contend with. So here's my question. Where do we put this data? Who has the skills to do this effectively? This is looking at incidents very differently than I think the way we're used to.
I just don't think that it fits in what we generally view as opposed to what I'm template these days. So some questions here and questions about sort of post-mortem templates or the templates that you use used as guidance for analyzing an incident or is it
actually the activity? Right. If I go into your post board meetings, will I observe something that looks like event analysis or does it look like a
visit to H and R block? Right. It's a very expensive, if it turns out to be, unfortunately there are organizations where this is effectively a form filling exercise. It's the most expensive form filling exercise I think I've ever come across.
Here's a question. What? What can someone new to your organization, a new hire, right. Learn. If you hand them your post-mortem document, you say, I want you to read this. I want you to try to understand this event in as many, in as many perspectives as possible. When they're done reading them, will they have good questions or have basic questions?
Because remember, if you weren't there in the response or weren't there at the post-mortem, then you're at a loss and these are what these are what helped make up for that loss. Then really the question is, are your post-mortem documents written to be read or written to be filed? Remember, not all incidents in the way we described them way we investigate them looks
deeply because they don't need to because they have different purposes in different, attendees, attendees, audiences, looking closely should be a requirement. However, to look at some of that stuff in real incidents, those messy details, those are the most important ones. So fine at this point you say, imagine a world where you say, all right John, I get it. Like you, you, you, you, you've convinced me, right? What are you, but what are you going to do? So there's good news and some bad news, with what to do. The bad news is that there's absolutely no scenario where I give you a simple, easy to follow prescriptive guidance for what is effectively complex challenge period. We already have that. It is, I'm not gonna, I'm not going to be a two down on post-mortem templates, but I've certainly seen some that are worse than others.
Here's a couple of things to ponder. dare to think outside your template. What doesn't fit in that field? What fields don't you have? Start writing down the questions asked in postmortems and start critiquing the questions. Forget about the answers, the questions, what you explore and therefore what you learn. It depends on the questions you ask. First and foremost, better learning means better questions, so you have to treat them as important as the descriptions they elicit.
Ask people what they were really concerned with or afraid of during the incident, right?
Even if what they were worried about did not come to pass, especially if what they were really afraid of or worried and anxious about did not come to pass. That is data. That's critical data and it needs to be captured because one day it will turn out that way. Ask people to draw diagrams. Easiest number. If there was literally anything you take away from this is hand someone a marker, right? Experts are not very good at describing what makes them an expert, but asking them to draw things. Can we reveal more than just their verbal descriptions? What made this incident not nearly as bad as it could have been? All incidents can be worse. We know this.
Go find out what people did to keep them from being worse and capture it. So consider listening to what an incident has to teach you,
right? Remember the incidents don't care about the fields in your post-mortem template. They don't care that you only have an hour in that meeting. They don't care what your, what your quarter has looked like. They are there to teach you something. It's your job to figure out what that is. So on the topic of people's multiple perspectives, right? Think about these as research questions that you can use going into an analysis.
How does a new person in a team learn these weird behaviors that are not
documented? Because the people in your organization who know a lot, all of these ins and outs, all of the nooks and crannies, they know stuff that's not written down. Even your in your perfectly UpToDate Wiki, what a veteran engineers know about their systems that others don't. What esoteric knowledge do they have? And more importantly, how did they get it? How do people in different teams and or expertise watching a network engineer
explain what's happening to a DBA is huge data. It you help find the vent overlapping, knowledge, about what they think is expected. What's familiar, what's important to pay attention to in an incident on the topic of tools and tricks, how do people improvise new tools to help them understand what is happening? All incidents are surprises. Full-stop surprises mean that there are sometimes very little, but sometimes very, sometimes a great amount of improvisation that's happening. It might not be in tools, but it's certainly, demonstrations of expertise.
What tricks do people or teams use to understand how otherwise opaque third party services are behaving? We live in a world where cloud services are built on other cloud services and integrate with other cloud services and the bow that boundary between, well let me, let me just say this, is it? Hey Steve, is that us or is this AWS? the answer to that question is really hard, right?
There's no crisp. There's no, there's no command line and say, is it me or is it the cloud? And yes, right. There's no exit codes for that. That's binary. Are there any sources of data about the systems, logs, graph, all of that sort of thing that people regularly dismiss or suspicious of? They exist. Not everybody has this in your organization. People have more confidence in this tools. A ability to tell you a story versus this tool. it is quite often in, you'll, you'll hear from multiple people that observability tools, have as their basic capacity and the ability to ask questions.
I think that this is excellent, that I'm interested in where those questions come from.
Finally, do people believe they have the authority to halt functionality or otherwise
take a drastic action? Right?
If somebody's in the middle of the night, believes that the absolute safest thing for your business is to shut it all down, can they do that knowing that there might possibly be some organizational
political blow back, a cautionary tale, the night capital event in 2012, 20 minutes,
$440 million. they, this, this, this team of engineers was really raked over the coals for not having shut it down sooner. 2015, New York stock exchange started to see some issues with symbols and they shut it
all down. Everybody was pissed that they shut it down. So I believe that looking deeper at incidents in this ways that I touched on is necessary, right? We have to level up. It's time that we level up as an industry. We have to look deeper. I want you to, I want you to, to think about it, how much easier it is for us to talk in incident analysis documents or in po
st-mortem meetings about the software and how difficult it is to talk about us.
We are the most important aspect. You all are the most important aspect. The reason why the AI beautiful future, we'll still have people on call is because we're really good. Your site is, or your service is working all the time because of you and we have to,
and we're ignoring it. We're ignoring how we do that really well. I think that will be worth it. Our current approaches don't highlight this and I think we need to,
I'm going to leave you with a quote from my friend Todd Conklin. Spent his entire career in safety at Los Alamos national labs. And he says it this way. He says, the problem with begin when the pressure to fix outweighs the pressure to learn.
If you walk into an analysis or a postmortem meeting thinking that your role is
to generate remediation items, you're going to generate remediation items. There's not one client in consulting or in the or in the the research consortium that doesn't say we just have a hard time following up on these things. And after, after looking at all kinds of different ways that organizations learn from
incidents and handle incidents, I can say that there's at least one cogent observation that I ha that, that we've, that we've come to itch is the remediation items aren't getting done cause th ey're not really all that good. And it's because they're being created out of a sense of having to do something. If instead you go in thinking there's a whole bunch that I didn't, that I don't know, or I'm open to understanding stuff that I thought I knew what I was incorrect.
If you go in with the idea that you're going to learn something that you didn't know, even in this run of the mill garden variety, the database query was took too long type of outage, and you focus on what learning is all about then, and maybe you've got a great shot at coming up with some memory mediation items that'll get done before the end of the day. If you don't, you're just going to have a whole bunch of stuff that's you're going to have to do. so that's my thought provoking bit. I'm going to be around for the entire conference, so I'd like for you to come up and ask me questions.
tell me I'm full of it, or, or anything along that line. Thank you for listening to me.