Dr. Richard Cook and I were honored to contribute to a chapter in the new book Seeking SRE edited by the lovely David Blank-Edelman. Consider it a signpost along the way as we learn more about how engineers actually do their work in real-world scenarios (rather than abstract or generalized descriptions of how we like to imagine they work in idealized ways).
Here is a portion of the introduction to the chapter:
The modern “system” is a constantly changing melange of hardware and software embedded in a variable world. Together, the hyper-distribution, fluctuant composition, constantly varying workload, and continuous modification of modern technology assemblies comprises a unique challenge to those who design, maintain, diagnose, and repair them. We are involved in exploring this challenge and trying to understand how people are able to keep our systems working and, in particular, how they make sense out of what is happening around them. What we find is both inspiring and worrisome. Inspiring because the studies reveal highly refined expertise in people and groups along with novel mechanisms for bringing that expertise to bear. Worrisome because the technology and organization are so often poorly configured to make this expertise effective.
Together with our colleagues, we have studied people doing SRE work, the problems they face, the approaches they take, and the issues that arise in the middle. From a distance, this work is often imagined as narrowly technical, even mundane. Examining the work as done, in contrast, reveals that SRE work is often stormy and sometimes dangerous.
This chapter gives a brief overview of what we think we now know about modern technology and the work of people who design, maintain, diagnose, and repair it. There are similarities between groups doing SRE work and those in working in other high-consequence domains. We conclude with some general suggestions about ways to better support the work and more specific suggestions about how to better characterize work in this demanding, conflicted environment.
The original title for the chapter was “In the Center of the Cyclone: SRE Work” which while a bit more poetic, didn’t seem to fit with the rest of the chapter titles. 🙂
We try in this chapter to describe what “cognitive work” is, what perspectives we take in studying it closely, and most importantly: why it matters so much.If we hit our goal, we will have given the SRE community new vocabulary and new lenses to see their work through.
One of the most important messages we wanted to get across is that much of what makes our systems actually work — and what gets them restored when they break — cannot be captured in quantitative terms, but in qualitative forms.
A glimpse of the section headings:
- What Do SRE People Do?
- Why Should We Care About Practitioner Cognition?
- Critical Decisions Made Under Uncertainty and Time Pressure Cannot Be Scripted
- Human Performance in Modern Complex Systems: The Main Themes
- Observations on SRE Cognitive Work Around Incidents
- Every Incident Could Have Been Worse
- Sacrifice Decisions Take Place Under Uncertainty
- Repairs to Functional Systems
- Special Knowledge About Complex Systems
- Managing the Costs of Coordination
- SREs Are Cognitive Agents Working in a Joint Cognitive System
- The Calibration Problem
- Mental Models
- Incidents Trigger Individual Recalibration
- Incidents Are Opportunities for Collective Recalibration
- What Are the Implications of All This?
And finally, we propose some routes to make progress in the section “What Can You Do?”
We hope the chapter can shake loose some new thoughts and coax new and better questions out of the SRE community.