I had to take up the challenge of speaking about Resilience Engineering at an Ignite talk that I gave at the always excellent NYC DevOpsDays.
Here’s the abstract for this 5 minute talk:
Of course the answer to the question in the title is “no” because this twenty-year old multidisciplinary field is as broad and deep as Distributed Systems. Bringing perspectives, methods, and concepts from Resilience Engineering is a long game; my goal is to whet your appetite and lay down enough compelling threads for you to pull on as this important long game unfolds.
Here’s the video, with the interactive transcript:
resilience engineering only in the last couple of years. So dispel some notions. Resilience is really nothing about computers or software, even though we say that - it's not about redundancy or robustness or high availability fault tolerance. It's not even about Chaos. It's not Chaos Engineering -- although it's related -- it's activities to sustain adaptation. We have to adapt and we're constantly adapting whether we notice it or not, in the people in the teams that are doing this work. Resilience plays out in the events that are unforeseen, unanticipated, fundamentally surprising. And this is something I want to really sort of drive home for you. What does it mean to be fundamentally surprising? Well, so if you imagine a lottery ticket, many of us who might buy a lottery ticket are not gonna win. But a situational surprise is where you buy a ticket and you win the lottery -- it's within the realm that you could imagine. A fundamental surprise is when you win the lottery and you don't buy a ticket. Okay? It is not even within the realm of potential in your mind. Forget about looking signs that it could happen. Okay, and this is really important. Our ability to handle fundamental surprises. They happen all the time. They're just not very dramatic. You all have experience with this. Resilience are proactive activities aimed at preparing to be unprepared in that way?
Resilience, this is the good news is already happening and it's present in your organization. The difficulty is that it's hard to see. It's hard to characterize. It's hard to identify, but an example in the wild, if I can, we continually invest in an ability to deploy -- not to deploy -- but the ability to deploy if we need to because it's needed for handling fundamental surprises. We have automated tests and peer, peer code review feature flags. All of these things that you see cannot be justified economically beforehand.
It is the audacity of doing that, that continues to keep our abilities fresh. There is a valid complaint that resilience engineering, because it's a lot of, it's described in academic journals, there's too academic, I say bullshit. You all can do the work. Understanding resilience engineering, just like understanding distributed systems will take time. The concepts with resilience engineering are not intuitive. They flip your mind around, but they're also critically important. I will say this, this is happening. There are those who are making effort when there's a conference ReDeploy and there are people in our industry who are furthering their own education degrees and online dialogues are happening. The next talk that I want you to see on this topic is this one. I'll, I'll tweet the link. There's a number of different threads that you all should pull on. Just realize it's going to take some time. It's gonna take some time to sort of grok this. I've put up a couple of URLs for you to, uh, read and learn more, especially from the people who are really putting in the time and effort to sort of understand and connect with this community. Thank you.