Adaptive Capacity Labs
Screen Shot 2020-03-24 at 9.29.18 AM

Can Resilience Engineering be sufficiently described in 5 minutes?

I had to take up the challenge of speaking about Resilience Engineering at an Ignite talk that I gave at the always excellent NYC DevOpsDays.

Here’s the abstract for this 5 minute talk:

Of course the answer to the question in the title is “no” because this twenty-year old multidisciplinary field is as broad and deep as Distributed Systems. Bringing perspectives, methods, and concepts from Resilience Engineering is a long game; my goal is to whet your appetite and lay down enough compelling threads for you to pull on as this important long game unfolds.

Here’s the video, with the interactive transcript:

So when we think about incidents in software, we tend to think about looking at them closely and as a way to understand and sort of prevent them from happening. We want less incidents. And that's what we're generally thinking about when we think about them, especially with postmortems. But follow me here for a moment. What if we were to look a lot closely at what happens when incidents are not occurring and try to maximize or increase that? So this is a key concept from this, a field called resilience engineering. My entire goal of this talk is to get at least a couple of you to come want to talk to me later about it. That's it. Um, so the first thing I want you to understand about resilience engineering is it's a field is a 20 year-old multidisciplinary field has a number of different influences, a number of different angles to it from all, all sorts of different scientific domains. The second is that it's a community. A number of the domains that you see here all have, uh, have a sort of paths into resilience engineering and they all have high consequence, high tempo, uh, work environments. Know, you'll note that I've got this sort of "new" indicator on software engineering because it's important, I think for us all to understand that that software world is brand new to resilience engineering only in the last couple of years. So dispel some notions. Resilience is really nothing about computers or software, even though we say that - it's not about redundancy or robustness or high availability fault tolerance. It's not even about Chaos. It's not Chaos Engineering -- although it's related -- it's activities to sustain adaptation. We have to adapt and we're constantly adapting whether we notice it or not, in the people in the teams that are doing this work. Resilience plays out in the events that are unforeseen, unanticipated, fundamentally surprising. And this is something I want to really sort of drive home for you. What does it mean to be fundamentally surprising? Well, so if you imagine a lottery ticket, many of us who might buy a lottery ticket are not gonna win. But a situational surprise is where you buy a ticket and you win the lottery -- it's within the realm that you could imagine. A fundamental surprise is when you win the lottery and you don't buy a ticket. Okay? It is not even within the realm of potential in your mind. Forget about looking signs that it could happen. Okay, and this is really important. Our ability to handle fundamental surprises. They happen all the time. They're just not very dramatic. You all have experience with this. Resilience are proactive activities aimed at preparing to be unprepared in that way? Resilience, this is the good news is already happening and it's present in your organization. The difficulty is that it's hard to see. It's hard to characterize. It's hard to identify, but an example in the wild, if I can, we continually invest in an ability to deploy -- not to deploy -- but the ability to deploy if we need to because it's needed for handling fundamental surprises. We have automated tests and peer, peer code review feature flags. All of these things that you see cannot be justified economically beforehand.

It is the audacity of doing that,
that continues to keep our abilities fresh. There is a valid complaint that resilience engineering, because it's a lot of, it's described in academic journals, there's too academic, I say bullshit. You all can do the work. Understanding resilience engineering, just like understanding distributed systems will take time. The concepts with resilience engineering are not intuitive. They flip your mind around, but they're also critically important. I will say this, this is happening. There are those who are making effort when there's a conference ReDeploy and there are people in our industry who are furthering their own education degrees and online dialogues are happening. The next talk that I want you to see on this topic is this one. I'll, I'll tweet the link. There's a number of different threads that you all should pull on. Just realize it's going to take some time. It's gonna take some time to sort of grok this. I've put up a couple of URLs for you to, uh, read and learn more, especially from the people who are really putting in the time and effort to sort of understand and connect with this community. Thank you.