Note: The transcript for this talk has been edited for clarity.
I want to do something I’ve been looking forward to for a very long time. All of the clients that we’ve worked with have signed NDAs. Indeed was kind enough to support giving this talk. I’m going to talk for a bit, and then I’m going to have Jason Koppe come up and also talk a little bit. And then I’m going to say something clever and wise and thought-provoking at the end.
That’s the plan, anyway.
Before I get started, I need to point something out that’s maybe a little bit meta. It should be clear to a lot of you that organizations that are software-reliant, meaning critically dependent on software, have been thinking differently, and it’s been happening for a while. We’re still at the beginning, but these topics that we’ve talked about, from speakers, with all of you in the hallways and through meals – these are not topics that showed up in the 1980s. The fact that this conference exists, and RE:Deploy before it, is even more evidence that change is afoot.
It’s really easy to lose sight of this, because a conference is just a conference with somebody who’s got some slide transitions, and Courtney’s using Prezi, there’s a booth and meetups and that sort of thing. But it’s the substance of what we’re talking about that’s important.
Let me give you some backstory – this is the “why should you pay attention to what I say?” part. What makes Adaptive Capacity Labs credible? Even though this talk specifically covers our work with Indeed and our observations as an independent party working with this organization, it’s mostly because they were cool with it. We’ve worked with some amazing incident analysts – some of them are even in the room, though I can’t tell you who they are. We’ve seen so much that it’s very difficult to describe.
After I left Etsy, Dave (Woods) and Richard (Cook) called me up and we founded this company. Since then we’ve worked with companies ranging from 150 people to 490,000 people, of all shapes and sizes: B2C, B2B, infrastructure, non-infrastructure, healthcare, all sorts of enterprise and internet-native organizations, and everything in between.
The important bit here is that we don’t do theory.
You’re not going to get theory from us. You’re not going to get theory from me today.
We are paid to do work that’s practical and hands-on. We know the theory – happy to talk with you about it outside of client projects. We’ll talk about it all day. But we don’t say “Here’s interviewing for incident analysis” and then spend a week talking about the history of cognitive task analysis and where the Critical Decision Method comes in and other knowledge elicitation approaches. We work in the messy details, and there’s not a lot of extra room for lecture.
Not all of the reactions we’ve had with our clients have been as enthusiastically positive as we’ve had with Indeed.
So here’s the “what” of this talk: this is an exemplar case of meaningful and concrete progress made in learning effectively from incidents.
By the way, that’s what this conference is about. This is not an “incident analysis” conference, if you may have noticed. That’s an important distinction. Learning from incidents is the reason you do this. This is a story about what they were able to accomplish and how they were able to do it.
Bent Flyvbjerg, in an excellent paper called “Five Misunderstandings About Case Study Research,” referenced Thomas Kuhn (who came up with “paradigm shift” and “scientific revolutions”). Kuhn said this: “A scientific discipline without a large number of thoroughly executed case studies is a discipline without systematic production of exemplars. And a discipline without exemplars is an ineffective one.”
This was easy for me to understand because Paul and I (that’s us in the photo) gave this talk at Velocity in 2009. Here are some of the messy notes from Paul’s living room table a couple days before when we were putting it together. The notes say: “This is what works for us. It may not work for you.”
We don’t have enough time to go into exactly how Indeed did this – for that you’d need a much longer time. Exemplars exist not because they are procedures, but because they have a different power. Some parts might work for you.
Here are some hopefully obvious statements:
- An absence of incidents is not evidence that learning is happening. Just because you’re not having incidents doesn’t mean you’re learning.
- The presence of incidents is not evidence that learning is not happening.
When it comes to incident analysis and learning, learning means… well, learning. As Dave Woods pointed out earlier in his talk yesterday, there are all kinds of different frames of learning, But a few things that learning anything requires are exposure to the thing, contextualizing or recontextualizing that thing, and reconceptualizing and reintegrating them throughout this messy process.
When it comes to learning from incidents via incident analysis, there are a couple of things that are different and often not talked about, as Pirmin pointed out in the RE:Deploy panel earlier. First, people learn different things from incidents. There is no canonical set of lessons – that is not a thing. People will learn different things from incidents primarily because they learn in different ways, at different times, asynchronously. They will often learn things completely unrelated to the incident or what others learn.
We cannot control what is learned – as incident analysts, as leaders in the company, as peers, we cannot control what is learned. We can create conditions and situations where people have opportunities to learn something new. There’s no guarantee, but they’ve got a shot at learning something new or better understanding something they misunderstood before. The goal of incident analysis, therefore, is to build the richest understanding of an event for the broadest possible audience.
The analysts are the least important part of that audience. Let’s say you’ve got an incident and there are some sort of post-incident activities. You’ll have a meeting or fill out a template, or you might interview some people . We ask this question all the time: who is learning as a result of these post-incident activities? It’s a very difficult question to answer.
Do people who responded to the event learn something different as a result of these post-incident activities? They were there. Do they just go to the meeting to tell exactly what happened because they were there and don’t feel they need to attend? What about people who analyze the case? The answer is yes – by the time they’re halfway done with the analysis, they actually know more about the case than any of the responders do. Why? Because they’ve already synthesized what a number of responders have told them.
What about people who didn’t respond to or analyze the incident? We have people – they’re the only ones who can learn. Let’s break this down:
We’ve got analysts, and of those:
- Analysts who worked on the case
- Analysts who didn’t work on the case but have the skills to do the work
And we have non-analysts. (What a selfish view of the world! “On Earth, we have analysts… and non-analysts.”)
Of those, there are people who were involved in the event – maybe people who responded to it, managers of responding teams who somehow participated. Others like customer service, for example, come to mind. And there were people who were not involved in the event. What about people who were on the teams of people who responded, but were on vacation or were asleep, or for whatever reason, didn’t have anything to do with responding to the incident? Or managers of non-responding teams? What about people who didn’t respond to it, who work at the same company, maybe in a completely different team, but who were responsible for some pretty similar tech?
Over there, they had some issue with Kafka or Mongo or mesh or whatever. We use that stuff. But they might think it’s probably not interesting.
The people who are responding – I think this is a gimme. Certainly there’s things you could do to make it hard for both analysts and the responders to learn. But let’s just say that that’s the case. What about these people? As I said a couple of minutes ago, the goal of incident analysis is to build the richest understanding of an event for the broadest possible audience. I used red for the emphasis there – “broadest possible audience” being the important part.
We can’t control what’s learned. People learn things all the time. They’re probably going to learn things, and you as an analyst may have no idea. Life’s not fair. Learning from incidents is situated, very related to Lave and Wenger, what Pirmin pointed out in communities of practice – it can’t be tested for classroom education. What people actually do, their behavior over time surrounding post-incident artifacts, are critical signals that learning is happening. People can’t learn anything from phenomena if they have no form of exposure to it.
Raise your hand if you want to mention your favorite part about that incident case I worked on two weeks ago. Anybody? No. Of course not – because nobody here has read, seen, or heard anything about what I’m talking about. Basic exposure is what a huge portion of this talk and what progress looks like. That’s the stage that we’re at. It doesn’t have to be more complicated. Information is not a scarce resource – attention is. This is Woods and Patterson and Roth summarizing, paraphrasing Herbert Simon.
How many here have voluntarily subscribed to a newsletter you’re not interested in? Has anybody here read an article you expected to be a waste of time?
Audience Member: “Absolutely.”
John: “Why?”
Audience Member: “You can learn a lot from a bad book.”
John: Yeah. If you don’t have an expectation that you’re going to get something out of it – I’m talking about your expectation and shifting that is important. We’re going to get into that.
Here’s a team of engineers This is fictitious – this isn’t Indeed. We have five people on the team and one manager. Let’s say something happens – an incident with the stuff they’re responsible for. Three of the five people respond to the incident. One of them was on call and two others were just around ready to jump in. One was asleep and the other one was on vacation.
So let’s zoom out and get a better picture of this fictional scenario – though it’s not a joke whatsoever.
We see that they’re one of many teams. Let’s flesh this out a little bit. Let’s say that there are three incident analysts analyzing the case.
By the way, as an aside, strong guidance has been and will continue to be: People analyzing an incident should not include responders to the incident. Also, analysis should be done by more than one person – ideally three people, but more than one person. Indeed does this – “what would Indeed do?”
We’re very much interested in what sort of attention, interest, participation, engagement of any kind happens with people who are not analysts of the case and not responders of the case. We call these non-analysts and non-responders. Richard was in charge of naming. I’m happy to take patches, but I can’t come up with anything better than “NANRs” (Non-Analyst/Non-Responders).
When people who were not part of the incident analyze it and go to a group review meeting, we would say that’s pretty great. This is a step in the right direction – the broadest possible audience.
Let’s say more people read the case writeup. Let’s say that people who read the case writeup weren’t even at the company. Let’s take it one further – What if you’ve written code and in that code comment, to explain what you’re doing, you provide a link to a writeup of an incident that happened two years before you worked there? That’s what the broadest possible audience looks like.
I’m showing you attendance at a group review meeting – it is quite typical at Indeed. The circles here represent the number of folks who attended it and roughly what part of the organization they came from. There’s 28 people who were responders mostly – they were responders and analysts. That’s a big room. Let me show you what the NANRs look like.
So these are people who went out of their way. There was nothing mandatory about going to this meeting. In your organization, it might not be officially mandatory, but you might gain social credibility by being seen there. You can’t really fake genuine enthusiasm and participation. This is amazing. This is not common.
Raise your hand if you have people from marketing coming to your group review meetings. For those of you who have your hands up, figure out what’s bringing them there.
This is five months of data. An internal newsletter is a very big deal. Something even more amazing is where these subscriptions for their internal newsletters are coming from.
So we’ve got these three Engineering orgs but there are people from Customer Service, Finance, Sales, Marketing, Legal, and HR. Is it all of those organizations? No, but I can tell you it’s probably more than yours. More than many.
What does this mean? This means a couple things:
- The newsletter’s written for a much broader audience than just software engineers. That takes effort.
- People outside of Engineering find it compelling enough to subscribe at work.
The table stakes question is: are people reading the writeups? This is from January, and I’ve split out the analysts. We’ll be hard on ourselves if we just don’t count them because they’re probably reading older cases and writeups to refresh their memory when they see a familiar theme. Also they’re writing them, although these are just views and not edits.
The orange represents NANRs.
They do surveys with questions like these after people attend a group review meeting: “I felt this event review meeting was worth the time I spent attending.” It’s a survey – you don’t have to fill it out, but people do. The fact that people fill it out is amazing. The fact that people fill it out with these answers is amazing. “I am likely to read the incident writeup after attending this event review.” This is a very big deal.
We’ve worked with them handling tricky, tough situations and really complicated mechanisms in some of the cases they’ve analyzed. Richard and I have asked them: how are you able to do this? How is Indeed able to do this?
They committed to building the skills necessary. They knew that this was a skill. They knew that this was stuff they didn’t know how to do. This is a thing they were going to get better at. Something that Richard pointed out, which I fully agree with: You do not learn how to ice skate by reading a book about ice skating. You ice skate.
This is a graphic that I had to pare down so it didn’t distract everybody while I’m talking. There’s a much more fleshed out version of what these skills, what expertise looks like. But practically speaking, it’s interviewing for incident analysis, which is entirely terrible when it comes to engineers – engineers are awful in their gut at this. The worst of all. It’s not a joke. The instincts we have as engineers are to do many things that are absolutely counter to interviewing in incident analysis. And no, you can’t read a paragraph to do it. They’re very good because they review and critique their own interview recordings when they can record them about as close to watching Gary Klein at his critical decision making workshop that I’ve seen.
They demonstrated what different looks like – at least enough to garner attention. Indeed’s been around for a long time. It’s not like they had nothing . Many of the companies that you represent and work at have had times when you tried a new template or this new thing. You tried different things. They needed to demonstrate what “different” looked like, because otherwise it wouldn’t have shifted expectations.
People come to the LFI team at Indeed requesting, “Can you do this case for us?” They have to turn some down. They’re in demand because they’re good.
Quick aside, what’s typical is a common cycle of decreasing value: People don’t read them because they’re terrible. And why are they terrible? Because people don’t put a lot of effort into them. People don’t put a lot of effort into them because they don’t think anybody’s going to get anything out of it.
But here’s the kicker. This is from the Adaptive Capacity Labs’ #general channel – I still believe that to this day the LFI team at Indeed made this circle go the other way. And they don’t stop there. They kept building momentum and adapted as they went, coming up with ideas that Richard and I had never even thought of. Most importantly, they anticipate any new barriers that might arise which could threaten that momentum.
Good analysis is not about capturing what you, the analyst, find interesting or important. Zero. This is not about you. It’s not for you. You are not the one who has the answers. It’s about discovering, exploring, and representing what others found to be interesting, important, unclear. You make the writeup compelling and people want to read it because it’s the story that they have already told you. This is what the LFI team at Indeed excels at. Jason, I’ll hand it over to you.
Jason: Hey everybody, I’m Jason Koppe. I lead SRE at Indeed. I’ve been at Indeed for a little over 14 years. When John and I were talking about the conditions for the LFI team at Indeed and what has allowed us to accomplish all of this…
The thing that came to mind was initiative and patient long-term progress. Pirmin mentioned earlier that it takes a long time. Dave mentioned yesterday that it’s hard work. Back in 2019, Alex Elman, who spoke yesterday with Sarah, started talking internally about Resilience Engineering topics, the Swiss cheese model, and presented them to the SRE organization. This began to seed the idea that maybe something could be different. Alex also presented publicly that year about that topic.
At that time, we think we just had one LFI advocate. Alex was thinking, maybe we should have a team of people that did this LFI work. But it wasn’t until later in 2020 that we got a second LFI advocate. In 2020 Alex and I started talking with John and Richard about engaging with Adaptive Capacity Labs.
Alex learned about Jeli before Jeli was announced and was able to get into a design partnership with them. This was actually a little bit confusing to me. I knew of Nora being the chaos person, so why was she doing incident analysis? Chaos was the thing she was doing. And then John was known for DevOps – why was he doing incident analysis? I was actually just confused. I wasn’t yet an advocate at this point.
We proceeded to go into the design partnership with Jeli. Alex, who I was managing at the time, said, “Hey, I’m going to be on PTO next week. Can you work with Nora and Laura to help them launch this design partnership?” I said okay. I met with Laura and Nora and I just couldn’t get this out of my head – why incident analysis?
So I started listening to some talks. I listened to Dr. Laura McGuire’s talk from RE:Deploy. I listened to Ronnie Chen’s talk on deep underwater team diving from RE:Deploy. After listening to those talks, I started to have an epiphany. I started to realize that learning from incidents was actually much more than just the incidents themselves.
I’m somebody who does rock climbing. When Laura was talking about this mission of getting to the summit with variables changing and needing to change the plan and work with a team to change the plan, it all started to click for me.
Because in my experience, one of the things I did to learn about rock climbing was reading the accident report from the American Alpine Club every year. And I listen to the Sharp End podcast, which brings survivors of climbing accidents onto the podcast to talk about what led up to the situation, how things were going, and what they’ve learned from it.
For me, in a very unplanned way, I realized that there was something bigger than the incidents themselves. At that point we transitioned into having two advocates. We worked on a budget proposal to create a dedicated LFI team and get the Jeli platform.
We didn’t actually get all of the headcount that we originally requested. We requested three full-time analysts, a manager, and a program manager. We got one, but we’ll take it. We also got clearance to have ACL come on board and do a training a couple months later.
There was another perspective shift for me during the ACL training. I did the training with the rest of the team, and I’m the head of SRE. I have about 90 people in my org. My job is to talk to people all day, to meet with them one-on-one, and understand what’s going on for them and help them work through that. I did this training and I was completely exhausted at the end of the week. I was so stunned at how hard this different work was that it catalyzed that we really needed this to be a full-time position, and we really needed a big team.
In 2021, we were able to hire our first external incident analyst and also have a couple of internal transfers. Last year we were able to grow the team to seven dedicated full-time incident analysts – some of whom are in Tokyo, where we have a presence.
When thinking about how we were able to do this, what you’re about to see is a collection of our different perspectives about what we think might be some of the conditions for the growth of LFI at Indeed.
Our corporate values from the very beginning – number one has been what’s best for the job seeker. This permeates a lot of our decisions when it comes to teamwork, working across boundaries, and prioritizing decisions.
Our leadership has created a very supportive environment where people can grow, have space to innovate, bring their self to work safely for ourselves as well as for others through belonging and psychological safety.
For a long time we’ve been a bottoms-up company. We even had initiative as one of the dimensions of performance management for a really long time. There’s a lot of this practice of autonomy, with different parts of the business doing what they think is best for the job seeker.
We introduced LFI a couple of years ago. Our product is mainly stable – we’re a job search engine. Employers can come to the website and post jobs, people search for jobs, we match them together. We’re beyond the startup phase; we have enough employees in the company to justify the full-time LFI team. Our Engineering organization is about 2,000 people, and our LFI team is seven of those.
One of the things that came out is that our employees are happy to engage in the LFI process. Somebody mentioned earlier today that engineers are happy to tell you their story. Just ask them a question and they’re happy to tell their story. This might be in contrast to some other organizations where people withhold information. I think our leadership has done a really good job of trying to create an open and transparent culture that fosters this.
Because we’ve been growing and changing constantly, there are sometimes unclear processes and boundaries. We don’t have a single software development process at Indeed. Teams have their own variations. We don’t even have a single incident process at Indeed. Groups vary with autonomy. There are some teams at Indeed that calculate MTTR and use that to drive their decisions, but not everybody needs to do that.
Because we change teams a lot, there’s expertise spread between different teams that sometimes gets pulled back into incidents and can help spread knowledge that way. This concept of helping others is actually built into our performance management system as well, which I think helps to accelerate a sense of reciprocity. That’s it for me. Thank you.
John: I know if I’ve been snarky earlier, it’s because I’m excited. I really didn’t think that what was happening would ever happen. To be fair, Paul Hammond and I didn’t think that before either and Continuous Deployment became a thing. It feels much more – many of you have heard me say this – it feels very much like this shift is a paradigm. It’s not a shift in perspective. I’m not going to soft pedal it anymore. This is a shift in paradigm within the software world. It’s always uneven, and comes in waves.
But it’s happening. It reminds me very much of the 2008-2011 timeframe. Your kids will see pictures of this conference, and their kids will see pictures of this. You laugh, but my kids have seen pictures of the early meetings in the 80s with Dave and Richard.
So I’m going to end with this: we now have an exemplar.
What if you had aspirations? Do you need the exact same conditions? No, obviously not. But this is evidence, just like all qualitative data – evidence that it has been done. And sometimes that’s all you need.
I won’t belabor the point about whether you need to get support from leadership or the flip side, which is “the man is keeping us down.” I obviously subscribe to Richard’s point of view and Dave Woods’s point of view – all in good time.
What if you did get all the leadership support you wanted? What would you actually do?
I’m hoping this talk will help at least provoke some thoughts and generate discussion. Indeed has done the work to show what different looks like. They didn’t just shift expectations inside the organization. I’m hoping this talk will do it outside as well. Thank you.