Adaptive Capacity Labs
image (4)

Incidents: What Is Often Missed & What Can Be Done About That

I was invited to give a talk at the Spotify office in New York last month on the topic of learning from incidents.

Here is the first slide, which I hope to be the BLUF (“bottom line, up front”) of the talk…

The Main Gist: 1) There is a shift happening towards understanding incidents in deeper and broader ways. 2) Incidents can yield much more value than has been recognized thus far. 3) Learning is not the same as fixing. 4) Doing this well takes effort and practice, but will prove to be a competitive advantage.

Here is the video recording of the talk, with an interactive transcription…

Thanks for coming. I'm, I'm hopefully gonna provide some entertainment in this talk. certainly the goal that I have is to perhaps introduce some, Oh, just some
different ways of thinking about some topics that you,
that I think you all are pretty familiar with. so, the other thing is that I was going to say is that I have, I have slides that I'm gonna sort of talk to, but I'm hoping, to have, there certainly will be questions after, but there might be a bunch of questions I might ask you all, during the talk. So here's the, here's the sort of the bottom line, sort of upfront to sort of give you a sense about what you're in for. the first thing that I want to get, get across is that there is a shift, there's a, something in the air and the sort of design guys to the tech industry towards understanding incidents in sort of deeper ways. I'd say that there's maybe a nascent, but I'd say accelerating community of people who are, who are looking to borrow techniques and methods. And approaches from, from other domains and apply it here in sort of in software. So I'll talk a little bit about that. The reason for this sort of growth is, of, of sort of interest, curiosity, enthusiasm is because incidents can yield much
more value than has been sort of recognized thus far.
And this is the conventional typical sort of techniques of learning from failure, learning from incidents and, and, and, postmortems and all of that, can go much further. And we're kind of leaving a whole bunch of, valuable sort of things, on, on the table. And a lot of this depends, sort of sits firmly on the, on this idea that learning is different than fixing. And it's quite often you hear a lot of cliches of, Oh, we never want to, you know, never want a good, let a good crisis go to waste. Or of course, we want to "learn from failure" and take, you know, all the lessons we can from our mistakes. But quite often what happens is we'll say that and we'll say learning, but learning is a pretty complex cognitive phenomena. There's perception and then there's memory. And so usually what we really mean, typically conventionally, it's not real. It really learning is more like fixing. So we're talking about that. I'm going to sort of end this idea of that, that, that saying that doing this really well takes, sometime I've been sort of studying this. I'm a recovering CTO. I used to work at a company called Etsy, which is, across the river, and much of doing sort of incident analysis involves maybe looking at doing some things that you're not used to, especially software engineers. taking some perspectives that we're not used to in the, in the industry. And, it takes practice and coaching and all that sort of thing. we'll sort of talk about some of the pretty straightforward things that you can do. Just sort of help build on that.
But, it will prove when you do this really well and I've seen it,
a small number of organizations who are taking this approach and putting this effort in turns out to be a much more valuable than, than just it never happened again or you know, we're not having similar, we're having fewer incidents, all that sort of thing. So that's basically the map of the talk here. so, I am a bit excited cause, Spotify and, when I was at CTO and I was working at Etsy, I, became interested in some of these, you know, other domains that I'll talk about. and I managed to find a master's program that, that I could do, while still living in Brooklyn and it was in Sweden. And I'm hoping to score major bonus points with you all. Now I understand, so learned is in Southern Sweden, which I understand is not really Sweden compared to sort of for people up in Stockholm. I've been told that it's sort of like the Nevada of, or the Denmark of Sweden. So
let's get back to this, right? So there's a shift taking place. Oh, and, and if,
if you, if you have trouble sleeping, this is the title of my master's thesis. so there's this shift that I'm talking about. Oh, there's sort of a growing sort of fast growing interest and enthusiasm for fields such as Human Factors, Cognitive Systems Engineering, Safety Science, Resilience Engineering. And these fields are sort of intertwined themselves in some varied ways and sort of emerged not except for human factors emerged largely in the last 20 or 30 years. but there's a lot of interest in software engineers, on this and, and, and that's a lot of where this interest in sort of broader and deeper incident analysis comes from.
So how do you know that there is this, how can you tell that, that,
that people are paying attention this? Well, we're in software engineering, which means that we can't shut up about what we're, what we're enthusiastic about. And so what you'll find are lots of conferences, conference talks, on a sort of various topics even in Australia. I will have to say that, I am lucky enough to have tricked, I mean convinced, ACM to do a special issue on the role of human performance in software.
And this is, I think ACMQ is free, on the web.
and there's a collection of five different articles, that, that, that place this with scientific rigor, this study of what makes human performance difficult and what, what we are doing that actually every day that prevents incidents from happening
because you are. You know,
it's definitely real when there's an entire conference, centered around these topics. the ReDeploy
conference, it's been two years now.
I think they're going to do it again. There I mentioned, London, the, the program that I was in, there's not only just graduates, but there are multiple people from software, from multiple different organizations that are also enrolled. And so, you know, that this is a serious thing when somebody makes the commitment, you know, if you know bananas, you know, John Allspaw goes off, that's just a one off. Right? But when there are more people in the industry, okay. Interested in making this commitment, you know, that it's, it's sort of real, I will have to say, that this reminds me very much. There was a world when continuous deployment and delivery and DevOps and, and even sort of concepts from distributed systems was not a thing in the industry. And it slowly became a thing. And, and they have sort of the emergence of, for example, the, I don't know if people are aware of the meetups
Papers We Love, and conferences,
talking about these concepts that were never were, they had been called to academic or, certainly takes a lot more effort to, because you have to read more than code. To understand them. And then finally, I would say that there's a, there's a, there's a website Nora Jones, who, who, who's, in the Lund program, is, graciously and, eloquently started. it's sort of a community. It's across cross company, interest, group. I will say that. What you will find. You see Nora Jones and Casey Rosenthal, these are, these are two, two names that are, that have been in the past, associated, primarily with, with Chaos Engineering. there's a significant overlap. and in fact, actually this sort of simplistic sort of Venn diagram doesn't, doesn't really capture it. We're not gonna I'm not going to talk about that. I'm certainly can sort of after the talk. but there are some, some important, parallels and, and contrasts there. So, so the shift in perspective, hold on one second. Let me get a water.

So this notion that,
that we can spend some time fill out a form or that all incidents have a similar structure that can fit into sort of boxes and, and we can sort of capture that is sort of, I'm going to put out as sort of a typical or conventional, a route approach practice. these generally end up in either documents that don't capture very much about what it's like to be in an incident or handle an incident or predict or anticipate or respond to an incident. it tends to, sort of collapse as much as possible into sort of objective and largely numerical. Here's when it started, here's how long it lasted and here's the average of the last of how long they lasted these last incidents. And here's some severity and here's some. And if you can put that into numbers, then it makes it great for spreadsheets and you can make charts of it and it gives you sort of some sort of comfort or sense that, that you have control over the future. as a result, it also turns into in an effort to become efficient, largely focused on fixing. And I want to make a contrast there between a different sort
of this broader, deeper sort of route,
which is a really an efforts to capture really the messy reality of incidents because they don't go from, and you know, detection to diagnosis to resolution in these nice little neat sort of buckets with these boundaries that are, that are, that are, straightforward and unambiguous. they tend also to gather multiple, multiple perspective views because almost always teams of people are the ones responding to and also making attempts to anticipate and doing projects that have, PR, presumably preventative, designs in them, informed by their experience in the past. As a result, if you, if, when, if and when you can do this, you can, I ended up with these sort of much more rich and compelling stories. These are documents that people want to read. People are looking forward to read and, and because people can read the idea that somebody can become familiar with or learn something from an incident that they didn't respond to, to an a, an event that was discussed that, that, that they didn't even attend, they didn't attend some sort of post-incident group meaning, meeting for it means that you can get things out of engineer's like, Oh, I didn't know it worked that way, or wait a minute, I thought it worked like this. I wonder, I wonder what will happen if this or that and spurring new dialogue is important as a result. if you focus on fixing, that's one thing. If you focus on learning better informed fixing can come from that. That's the gist of what I'm trying to go for here. Give me a roll of duct tape and I can fix anything. so what's the result? What, why would, why would you do this? Well one, it means that you could think of and what you would imagine a writeup of an, of an analysis
that could be used as,
and sort of even acknowledged as a resource of confidence, a rich source of confidence for engineers. If you think about all the tests we write and all of the dark launches and the feature flags and the config flags and AB testing and, chaos experiments on [inaudible], all of that is about generating and or possibly, I'm amplifying confidence that the code is going to do what you expect it to do under the conditions. You can imagine that it can do [inaudible] What incidents are, are effective, extremely effective directors of attention incidents are like these sort of, if you've sort of personify an incident and incidents, if you can think of it as an insert saying, Hey, everybody come. I think you should, I think you should come look right around here. Cause I'm pretty sure it doesn't work the way you think it does. and then that way it directs your attention, in a, in a, in a particular way. And then that way you can get a better sense. They're sort of showing you that blind spots are there. Yeah. As a result, you can, have better in, in informed project roadmaps. You can imagine a world where you'd say to a group of staff engineers, all right, look, I think this thing is, we should replace this piece of the stack here. so can you come up with some general ideas? We're not gonna build anything. Just general, some options, do some sort of design, like a whiteboard or whatever. And then like, let's talk about it after, after you all come up with one or two options. Okay. Oh, and by the way, you are prohibited from including and taking into account any experience you've ever had with incidents with the stack. Sounds like a ridiculous thing to say, right? Whether we know that they affect our, our plans for the future, they definitely do. We find that knowledge islands, that is to say, especially in fast growing companies, engineers may have esoteric, really sort of no, almost sort of specialized knowledge about this. Well, this thing that everybody knows, they know, but it's not like you can go on the Wiki and say like, what does Lisa know? And just look up what Lisa knows. The reason why you can't do that is actually Lisa also might not know that what she knows. We'll be [inaudible] important sort of consequence in the future. but what incidents can do is show you what people do know because it's everything's time compressed and consequential and everybody knows to call Sylvia when this thing happens. That's the run book. The run book is "call Sylvia" because when she shows up, it gets done and it's just straight forward. It's going to sort of magic. This is what does happen. you understand about, different teams depending on each other's their services, their sort of, their tooling. you can use incidents as training materials for new,
engineers for leaders who want to understand, are we still using this service?
Should I cancel this contract? I don't know. Should we look at it in an alternative? The details, these messy details can help inform those. the, there's a a great deal of
skill expertise that you all have as hands on engineers that you might not know
that you have. This has been proven over and over in, in, in, in cognitive research. This is just experts are not necessarily expert at describing what makes them an expert. This tacit knowledge is a significant fuel for what keeps all of your systems working most, if not all of the time. So why is the shift, Oh, that last bullet was about legacy, which is just Spotify. I can't admit I've met some companies that have legacy system that I can imagine as Spotify is probably totally different. the fact is is that that there's no boundary, but on April 13th, something doesn't go from not legacy to legacy, right? When something becomes declared legacy, the size of what, like what, what the boundary is, what is, is, is labeled legacy is pretty subjective. The question is what views from which vantage points are they assessed? Deep Birla analysis can do that. Okay. So, against sort of this contrast, right? The, the, there's is an assertion. These are probably, they might be controversial.
Current and typical approaches to learning from incidents have very little to do
with actual learning. And the reason why I feel strongly about this is in most of the organizations that, that I've seen including one that I have run myself and led myself is that,
most post incident review documents are written to be filed, not written to be

read. presumably there are parts of Spotify that write post incident documents.
Is this fair? Right. Great. Awesome. Amazing step. The step one, how many people read them?
You read them. How would I know that you read it?

I mean if you, if you think about it, I mean,
counting the number of accesses on a webpage, I'm no expert, but I'm pretty sure that this is, this can be done and it's pretty simple. But you would think, well, but if why are you writing it down if people aren't reading it and if people who do read it have a full understanding of the incident, if they weren't there, then I would argue then who's learning?
The one, I do want to put a bit of a point here,
and I have some sort of credibility on this front in that this the notion of blameless retrospective and the sort of concerns and they're very important, very critical concerns around psychological safety, allowing and supporting engineers to give an account of, of, of, of what was going on for them at the during the incident, before the incident, after the incident, all that sort of stuff. Critically important and also is only a
is only a required condition.
You can be an absolute poster, a company of psychological safety and learn very little. It's a necessary but not sufficient. Blamelessness doesn't magically make you learn. It sets up the conditions where learning can take place. It rides on top of that. so there's this difference and a lot of the, the, the, the, a lot of what has led the, tech industry to take the approaches that we have been is a bit sort of messy. So sort of all over the place. We'll take, we've got some methods over here from old steal some stuff from Toyota. And I, you know, I saw this thing on, in, air, you know, NTSB report and all of them alone sort of cobbled together. Yeah. In the end, how we imagine incidents happen is what fuels how we believe we should capture
how the, the incident sort of took place. But there's a difference.
There's a gap between how we imagine incidents happen and then how incidents actually happen. and there's, there's a significant reason for that. And the biggest barrier probably that underpins almost all of it is hindsight. This, this notion that you have this tendency to simplify a complex event right down
to a single or certainly in some cases sort of a linear story like dominoes or
whatever. A singular here is thus story, the timeline constructed. That is the tendency. And largely there's a number of different reasons I can certainly go on and on about it. But, but suffice to say that, an uncomfortability about this surprise, all incidents are surprises and a surprise as a contrast to our expectations. And if our expectations don't play out the way we expect we are, then we can be fearful that those expectations will be breached in the future. Our desire for comfort about control, about being able to predict the future is what brings hindsight. Some has said, with some pretty convincing evidence that it is that hindsight and what's known as the hindsight bias. what once was it was, it was once called the, "I-knew-it-all-along" effect. It's I is one of the most studied, closely studied. I intensely studied psychological phenomena in, in modern cognitive psychology. The result of this is that multiple contrasting perspectives can get sort of wiped away in favor of this one sort of line, right? This perceived need to be efficient and crisp in the story is, if I were to use a compression audio, video is lossy, right? Smoothing out this messiness is like throwing away valuable data. So for, and let me give you an example. So I'll give you a, I'll give you a true and accurate statement. A high school senior in Illinois
led classmates,
an L on an 11 hour crime spree committing fraud, grand theft auto and cyber crimes. This is a description that is absolutely accurate. There's not one part that's wrong. I have described Ferris Bueller's day off the movie. If you've, if you've ever seen it, this is true, this statement, a high school senior in raise your hand if you've seen Ferris Bueller's day off. Hey, it's Colette. It's better than I thought. So, now of course if you've seen it and you remember the movie that you can imagine, this is a very true story almost certainly from the perspective of principal Rooney, right? This is some, this is an explanation, a description of an event. Now, if we were to look at how others, here's, here's another representation of Ferris Bueller's day off, you could make an argument. This is really just for more of a detailed timeline. It certainly has more data to it, but I can tell you it has a lot more, it's significantly more data magnitude wise. Then that last statement I gave and yet still contains that. Now, of course it's, you wouldn't even say that that was really sufficient. Cause as far as I can tell, I can't get really under any understanding of, you know, Cameron's experience
and Cameron had some pretty tough experiences on that on that day.
So this notion that what hindsight does is that before and during an incident, there are conditions that are set up. There are, there are, there are actions that are taking place in parallel at the same time, all of which are, are, contributing to a particular situation. After an incident. It can be easy like this, you know, like this dude with the, magnifying glass to see a portion. And here's how it looks, how clear it is I'm getting, we're going to give you a, I'm going to try something here.
So I want you to pay attention to the screens. So,
I'm going to give you a 10 seconds and I want to ask you to count the number of F's, the uppercase and lowercase, okay. The fear in the following sentence.
Okay. So now I want you to raise your hand
if you
believe there are four F's, yeah, raise your hand if you, there are five F's.
Where's you're going to have, there were six F's. Where's your interview?
Think there are seven F's. Raise your hand if you didn't raise your hand

to any of those answers. So there are six F's. Now I went,
I walked into this room thinking, now here are some people who have some pretty expert experience with the letterF and I thought to myself, there's no way that they could possibly fail this. There's no way that they could get this wrong. and by the way, everyone gets this wrong, otherwise I wouldn't make an exp. You know, it wouldn't be funny. Interesting as a slide. the, the, the fact of the matter is, is that in most Western languages, we do not read letter by letter, reread by shapes of words and a whole bunch of other sort of portions of, of, of, of how words look when they're next to each other. yeah. Now imagine a world where I've given you this, this, this test, which as far as I can tell is was ridiculously simple. Which many of you did not get. Correct. Okay. If I get to write the story, I can tell. I could say that there was a lack of rigor, right? That you weren't paying enough attention.
Software engineers, you all, you work in text all day.
How could you possibly not know this? This was not a failure of procedure despite clear and obvious instructions, they were as clear as they could possibly be and you all ended up being negligent. And I can, I can say this, right?
And so what I would say as a recommendation is we would retrain you right on
what the letter F. so no, but if I now all of these statements, all of these sort of a summary, this is a summary of your experience written by somebody who didn't undergo the experience. Okay. If I changed some conditions, if I gave you 60 seconds, the results would be different. If I printed it all out and gave it to you and gave you 10 seconds, the results would be different. The fact is that you're at a, you're at a internal conference and I've got this white background and you can already see if I colored F's differently, you would have gotten it - that there were six there. So the point that I want to get at is who gets to tell a story is significantly important as to what the fixes are as to what the description and as how you remember the story in the future and how you'll tell your colleagues about this particular thing to do. I'm not, if I don't, if I leave us with this, I'm not capturing a significant portion of what that experience is like. So now here's what can be done. You're like "I got it, John. We're terrible. I understand. It's the industry's awful." So there's this sort of conventional, actually, first I want to ask, first is there anybody raise your hand if you're on call right now? Okay. If you have to just jump up and leave. I'm totally cool. I understand. I have been there. That sort of, don't worry. All right. Now I get to sort of two questions here. I want you to raise your hand if you've ever responded, if you remember responding to a incident and remembering, experiencing such a profound sense of confusion about what you were seeing, that you'd simply had no plausible explanation. Like, I have zero ideas what's happening right now.
Yeah, that's happened to me too. I want you to raise your hand. also,
if you're responding to an incident and you're, you're, you know, you and your colleagues workout, okay, we don't, we, we may know what we know. We might know what's happening. We might not know what's happening, but we think we're all, we're all agreed we should do this thing. Like we should take this action. We should run this command or re hit this button or do this thing. Right? I want you to think back to right when you're about to do this, right? The idea is that it's either going to fix it or prevent it from getting worse or whatever. Why don't you to zoom back in your memory and try to think, is there ever, ever a situation where you remember right before you're about to do it, thinking there is a 50% chance this will make everything worse.
Okay? Okay.
What all have told me and the reason why I've raised laughing is cause it's so familiar. By the way, I've tried to give this sort of examples in to a room full of air traffic
controllers and they don't laugh. but the reason why it's funny,
the reason why it's so familiar is because it's palpable. It's in your gut. This is a part of your experience. And guess what, that experience is the stuff that makes stories people
will read,
come back to this this conventional view of where post-incident analysis have value is a little bit like this. There's an incident, there's maybe somebody might do some amount of, of of prep, maybe not. They might have a big, a big meeting. Sometimes they call it a post mortem meeting or a post incident and it doesn't really matter. and then, there may or not be report, but the sort of the tendency is to think that the greatest, the most important thing is to get the action items. Cause if you don't do it, then we must not have learned.
But it turns out that this is an extremely expensive meeting and when this goes
really poorly thrown a lot of money out the window. can you raise your hand if you've ever heard of this parable of the blind men and the elephant? Okay. I'm going to, I'll, I'll briefly sort of recap it. but, the way this, the story goes, the way I tell it anyway, is that there's this, there's this Monarch, there's this King and who, who just is fascinated by the world outside the palace but is totally afraid of and doesn't want to go. So he goes to his trusted, trusted advisors who are these, these blind blind men, these sort of monks. And he said, listen, I've heard about this thing, the elephant have heard some various bits here and there. I want you to, is it real? Is an elephant real. I've never seen one. I'm just kind of fascinating. You go out in the world finding an elephant, I want you to come back and tell me all about it. Cause I, I, I'm bananas. I, I need to know about it. So they go out and they find an elephant and they come back and they said, well, did you, did you find an elephant? And he said, Oh yes, yes, your Highness, we found an elephant. And it's like, well, what, what is it? What does it like in the first monks? Says, well, Whoa, yeah, well I'll tell you, I'm your Highness. And an elephant is like a really flexible hose and that, and it's, and it sort of, it moves around and it's a bit rough or whatever. And another monk says, no, actually, elephant's a huge cylinder. In fact, there's a couple of cylinders that like go straight to the ground. And I said, I don't know what you're talking about now the monk says, it's like, it's like a big wall. It's like a sort of a slightly bumpy, rough and it's very sort of hard wall. And then the other one says it's really flappy or there's like a little sort of, so they're all describing now all of their perspectives are true and accurate. They're not, and they're there, they're valid. Most importantly, they're not wrong. They're just not sufficiently having their stories but together. And if you can do that, for example, I see the front of this lectern, but you all can, it would be ridiculous for me to describe or foolhardy. I'm being dramatic. I'm going to calm myself down. It would be, sort of foolhardy for me to try to describe the front of the lectern without talking with you all. What we find when you can do this are something sort of dialogues that look a little bit like this. When you have somebody who has a picture, a mental model of how things work,
setting up the conditions for somebody to S to contrast their mental model with
how it
works with somebody else's and more than one person,
possibly somebody who has more context of not just how it works, but also the origins of how it became to be the thing that they, that they now know. And now you've worked out this contrast. You've got this opportunity for people to understand, I have, I don't know about this part. I didn't know it work that way. The number one thing that we would hear. so as a result, you would see when you can do this and capture this really well, this, this person captured it as, as, as well as could be. Except now my reaction to an incident is, Oh boy, I can't wait to read the writeup. You set this expectation that our writeup will tell you some, some stuff that you otherwise wouldn't be able to get. Get my friend Lorin Hochstein at Netflix points this out. So what is an alternative? Well briefly and have too much time, but briefly, if you spend some effort and time to do an initial analysis, there's lots of data about an event and what surrounds an event. Prior to this really expensive group meeting, you can, including one on one interviews gathering what people's perspectives are, what was difficult for them, what was surprising for them, what made it surprising, what made it difficult, what still remains a mystery to them about how something broke or even more upsetting for software engineers, they might be confused about how it got fixed, which can be sometimes
more unnerving then in that world,
this group meeting is effectively a, an opportunity to, for sort of present a collated picture from these multiple perspectives, right? And as presented within context of, of each other placed into a context. Here's a slightly controversial and don't use that meeting to develop ideas about what to do about it and separate that in time. Have you ever been really on a bug and the idea of how to fix it comes to you when you're in the shower or when you're at the gym? Soak time is extremely important, especially when working through and grappling with complex events. All the while you're constructing a narrative and this narrative has more than just action items. Like I said, there are some opportunities that you can use, that it has a much more to give. and I've, we've seen this go, go incredibly well. so I would say this incident, interesting incident analysis documents get read compelling incident analysis documents, get read and shared with others, but fascinating ones get read, get shared with others, commented on, asked about, referenced in code comments in pull requests and architecture diagrams and other incident write-ups. In new hire onboarding. And I can tell you this for guarantee uninteresting documents don't any of it. Okay. It's quite often what we think is the problem must be with the sharing. Well, but how do I get people to read it? Well, don't have it be terrible. You can have a method of distribution, but if the content is terrible, then they won't come. Right? Right. Green lantern was a movie that was shown in thousands of theaters. People didn't go cause it's terrible.
So make effort to highlight these messy details. What was difficult, surprising.
How do people understand the origins? What mysteries still remain without this? It doesn't provide the backdrop for the incidents. It's sort of a skeletal, it's that, that one sentence story of, of Ferris Bueller's day off. Here's an example. He just
needed a little space. What is that? It's a joke,
but that's the punchline, right? It's funny if you have the, the joke. Did you hear about the claustrophobic astronaut? That's the punchline. The punchline isn't funny. It's only funny with the setup, right? So how could you do this? How could you --- let's say you're, you're a, a, you're, you're trying to find, a like, you know, there's a two hour outage and I've got a Slack transcript, you know, sort of, longer than, you know, the stage or a recording that I have to sort of go through. What are signals that, that lead me to think that might be worthwhile to dig into? What would, what questions should I ask somebody if I'm prepping, if I'm doing sort of an interview? Well indications that people were surprised. Novelty, by the way, these are, these are real things that people have said in chat. All right? It sounds straightforward, but these are people. Here's the thing. We know about incidents. People only type things that they believe are important for them to type at the time. And they do not type things that they do not believe are important at the time. So when Steve says, I have no idea what's going on, and Laurie says, well, that's terrifying. That's about as good as, as an invitation of go talk to Lori and Steve. As you're going to get
understanding how people get confused.

It can also be pretty straight forward.
You just have to be able to take the stance that you're analyzing an incident. You're, you're both in the details, but you're also looking for the most, the ripest areas. Just sort of to dig into. So I'm gonna leave you with some sort of things to experiment with. Okay. You know, like I said, I don't work here, so feel free to dismiss any of them.
Try this.
Try separating the generation of followup items or action as a remediation items from a group incident review meeting the goal of the review and you did say you'd then say, well, what's the whole goal of the meeting? Then the goal of the meeting is to generate new data to generate this contrast between different people. You're effectively getting all of the blind men in the room before they go talk to the King and placing their stories together, describing to each other what they think happened, record in the document that you're going to sort of save who responded to the incident and who attended the group meeting. I mean it's pretty straight forward. You should be able to get those. If you do that now I'll read her six months down the road, we'll have some understanding of who to talk to and what questions they might ask, so capture, capturing things that were done after and sometimes a, an incident will happen and engineer engineers were like, okay, that's totally over. All right, I am before I go like get a drink, I'm going to go, I'm going to do this. These things because I just don't, I can't, there's some things that simply can't wait. We're not going to wait for post-mortem. I'm just going to do these things. And he's like, yes, please do those things captured those. It's quite easy to forget that those are things, but those are literally the most important action items because people couldn't wait. There was sometimes there might not be a ticket or or sort of an a, a sort of evidence that even happened. Give your write ups to a brand new engineer and ask them to record any and all questions they've got have after reading it. Do you have any questions? I want you to read this. I want you to tell me what you think happened in incident and if they, and if there incident, if their questions are really rudimentary, like Ooh, what's the, what's DB SuperDuper cause that says there's DBSuper or everybody knows what DBSuperDuper as well, but they don't, and then use that to flesh out link this sort of company specific jargon to the documents that describe them. You want to make things easy for your reader.
Ask more people to draw diagrams quite often. I mean I'm,
I'm one of those engineers who will see a wall of text and be like, Ugh, all right, I have to read this or I have to do some work to figure out what's in here. That's interesting to me. Having some more, having some sort of visual representations is, is important. a bit more controversial as have someone who is not involved in the event lead the analysis. I'm going to, there's a great blog post on the learning from incidents software by Ryan Kit chens. and I would suggest that you, that you go to it. and he says it this way, which is if the incident analyst participated in the, in the incident, they will inevitably have a deeper understanding and bias towards the incident that will be impossible to remove in the process of analysis. People who are involved will not ask questions that they believe. Everyone knows you're not analyzing the incident just for the people who were there. Flip it, reverse it. You're analyzing the incidents for people who weren't there.
Resist focusing on reducing the number of incidents may or may not be for,
for sort of managers. But I'll again say what Ryan has said, split the cliche idea that we would work, that we do this work to reduce the number of incidents or to lessen the time to remediate is too simplistic. Of course, organizations want to have fewer incidents. However, stating this as an end goal actually hurts our organizations and date. It will lead to a reduction in incident count, not from actually reducing number of incidents, but rather lessening how and how often they are reported. What can, what constitutes an incident is much more negotiable and fuzzy. Then, then many of us would like to admit as a result, a, maybe we don't have to call this an incident and we don't have to, it won't show up on a, on a graph that goes to King of the world or queen in the world. Instead, if you were to flip that around and focus on increasing the number of people who want to read reports and attend post incident review, then you're going to be in some much, much better shape because what they learn there will influence all of the things that they do. That goes into not having incidents to begin with. Okay. If that makes sense. so that's all I have for slides.