A few weeks ago I tweeted this thread which references sacrifice decisions and contrasts some facets of the Knight Capital (2012) case and the NYSE trading halt (2015) case: On Aug 1, 2012, a company named Knight Capital experienced a business-destroying incident. Much has been written about it, but that’s not the topic of this thread.
In 2017 I gave at talk at the DevOps Enterprise Summit and described a sort of frame that my colleagues outlined in the Stella Report. I walked through it bit by bit, and I’ve captured just that 6-minute excerpt in this video here. However – not everyone likes to watch videos of conference talks, so below
Dr. Richard Cook and I were honored to contribute to a chapter in the new book Seeking SRE edited by the lovely David Blank-Edelman. Consider it a signpost along the way as we learn more about how engineers actually do their work in real-world scenarios (rather than abstract or generalized descriptions of how we like to
In August I was honored to speak at the inaugural REdeploy conference centered on the topic of resilience. Here is the abstract for the talk: Abstract Sustaining the potential to adapt to unforeseen situations (resilience) is a necessary element in complex systems. One could say that all successful endeavors require this. But resilience is (in many ways)
In September I gave a keynote talk at the PagerDuty Summit in San Francisco, and I echoed some of what I covered in my talk at the ReDeploy conference. For those who like oversimplifications, here is a bulleted list of “takeaways”: Real incidents (and the response to them) are messy, and do not take the nice-and-neat form
The conventional rationale for undertaking some form of post-incident review (regardless of what you call this process) is to “learn from failure.” Given without much more specifics and context, this is, for the most part, a banal platitude aimed at providing at least a bit of comfort that someone is doing something in the wake of these surprising