The Criticality of Sustaining the Deployment Pipeline During COVID-19
Much thanks to the members of the Learning From Incidents community who helped review and contribute to drafts of this post!
The world is in crisis but I feel like the approach that tech companies around the world are taking to immediately add significant bureaucracy to their charge management processes is going to go down in history as a small footnote of bad decisions made in this crisis
— Camille Fournier (@skamille) March 19, 2020
We’ve made some observations over the past month about how tech organizations are adjusting their practices as a result of the COVID-19 pandemic. It’s still too early to synthesize these observations into what we might call “patterns” but some of them do appear to be taking shape:
- COVID-19 reactions are echoing across work around the world. Closing “non-essential” businesses forces many who are used to working alongside their colleagues to work from home (WFH). The ability to make this shift has much to do with progress in IT. The present crisis has accelerated the already established trend. Even so, forcing WFH is producing new challenges, pressures, and uncertainties.
- In parallel, new patterns of internet use are occurring in response to abrupt business shifts, social dislocations, and supply distress. Both the volume and the types of use are changing. This puts strain on the technology infrastructure and the organizational support structure. At the same time, the importance of these structures is sharply increased.
- The situation is rife with uncertainty. Leaders are trying to cope with the resulting complexity. The coping strategies are revealing of those leaders’ mental models of their systems and the looming challenges. Some are seeking safety through tighter, more elaborate change control. The extreme version of this is the “code freeze”.
- Attempts to “freeze” advanced technical systems are, first and foremost, a reaction to uncertainty. Even if current ops are maintained by the “usual suspects” there is concern that these people may be distracted or themselves become ill and unavailable. Interactions are now heavily dependent on frail consumer-grade communications networks. New coordination costs are likely.
- Adopting more elaborate change “approval” processes soothes the bureaucratic mind by producing the illusion of control. These represent a mistaken belief that performance is stabilized by throttling change. This belief is not only incorrect, it is a dangerous fantasy. Pursuing this course raises the stakes of and diminishes on necessary changes.
- To keep systems working while they are buffeted by external and internal disruptions requires adaptation. The capacity to adapt is more important than ever. The avenue for adaptation of modern IT is the deployment process. For the health of these systems, sustaining the deployment pipeline and processes is essential.
What we’ve heard from leaders in the industry as recommendations for where tech organizations should put their attention and focus appears to align with the general perspective on continuous integration, deployment, and delivery.
These recommendations include:
- Protect the capacity to deploy new code to production. This capacity is as important as the service itself!
- Confirm which engineers are knowledgeable/responsible for keeping the deployment mechanisms working, and what their availability is if things break.
- Identify these experts and encourage others to lean on them when there are questions or concerns.
- Find “cheap” things to deploy at least a couple of times per day. Things like code comments or minor changes to HTML content are candidates
- Hold off on experimenting with new mechanisms to deploy.
- Encourage more hedging tactics. Help engineers gain confidence in their changes. Provide opportunities to test approaches. Support project work that employ these methods. A few hedging approaches are:
- Feature flags, staff-only or team-only flags
- “Dark” launches
- Gradual rollouts (or “percentage ramp-ups”)
- Isolated rollouts (segregating rollouts using network/environment barriers, e.g. deploying one-AWS-region-at-a-time)
- Ask everyone to actively seek and expose hidden dependencies. Established ‘safe’ work patterns may become more hazardous. Changes in internal or external environments can expose new paths to failure. Anticipating and coordinating work relies on understanding the dependencies — upstream and downstream — both new and old. Mapping the dependency network and evaluating its implications will take more time and involve more people than they have in the past.
Ask all team leads to examine the projects in their plan, and characterize the next 2-3 months work into buckets:
-
- Work that cannot be postponed. This may include work involving applications or infrastructure that has “hard” limits with respect to dates, such as licenses, “end of life” products or services, or physical location constraints such as datacenter contract obligations (to move in or out), etc.
- Work that other teams depend on for their plans. Intertwined project plans — especially infrastructural and architectural foundations — are common.
- Work that can be postponed, and what the implications might be for pausing that work.
The goal of this exercise is not to comprehensively identify all potential dependencies or issues that might come as a result, it is to be deliberate and explicit about what dependencies can be identified and explored, with teams doing it together.