A case for Operational Safety in software operations

Estimated Reading time

5 min

Hospitals and GPs couldn't treat patients, thousands of airplanes were grounded, and small businesses couldn't process transactions. The 2024 CrowdStrike outage, which affected over 8.5 million devices and is considered one of the largest IT outage in history, was triggered by a faulty software update. How do we make sure another CrowdStrike-like IT outage does not happen again? The answer is Operational Safety!

In a world where software powers everything from our cars to our healthcare systems, the impact of a single change to software and configuration has never been higher. While we've become experts at deploying updates rapidly, we've neglected to advance the Change Safety measures that ensure these changes don't lead to breakages and failures.

The false choice between speed and safety

Imagine driving a car that's getting faster and more powerful every year, but the brakes and safety features remain outdated—or worse, each driver is responsible for crafting their own safety gear. Sounds risky, doesn't it? This is precisely the state of modern software deployments.

We've turbocharged our ability to deliver code, moving from custom scripts in the '90s to today's automated CI/CD pipelines and GitOps practices. Tools like Jenkins, Kubernetes, and GitHub have enabled us to execute millions of automated deployments daily. The change safety mechanisms that should accompany this speed have not kept pace. While teams often implement their own change safety measures—such as custom scripts, manual reviews, and homegrown testing—these ad-hoc approaches can result in unsafe and risky operations.

This lack of standardized change safety measures is akin to powering an old sedan with a Ferrari engine while neglecting the brakes. The car might be fast, but without reliable brakes, it's a disaster waiting to happen. To navigate this challenge, we need to acknowledge the unchanging realities that define our software landscapes.

What won't change in software systems

There are unchanging realities in software systems that make standardized change safety tooling imperative:

Software will continue to power essential services and infrastructure. Whether it's healthcare systems, financial transactions, or transportation networks, software is at the heart of it all.
Markets will always favor those who can innovate quickly. The demand for rapid delivery of new features and updates isn't slowing down.
Complexity will continue increasing monotonically. The shift to microservices has fragmented applications into countless interconnected components. Dependencies on open-source libraries and SaaS solutions, along with rapidly evolving tech stacks driven by emerging technologies like LLMs, add layers of intricacy. This escalating complexity significantly heightens the risk of unforeseen interactions and failures.

Overlooking these constants isn't just theoretical; it leads to real-world disasters that affect millions.

Failures have real-world consequences

Take the 2024 CrowdStrike outage as an example – what began as a routine software update quickly spiraled into one of the most devastating IT failures in recent memory. A flawed software update from the security vendor led to millions of Windows systems worldwide crashing with the infamous Blue Screen of Death (BSOD). This single change resulted in thousands of flights grounded, hospitals couldn't provide critical care, and disruptions for financial institutions and small businesses. Insurers estimate the outage will cost U.S. Fortune 500 companies around $5.4 billion.

The 2022 Atlassian outage provides another stark reminder of the catastrophic consequences that come with inadequate change safety measures. On April 5th, 2022, a script intended to delete a legacy app inadvertently deleted 883 customer sites, affecting 775 customers. Some experienced outages lasting up to 14 days, losing access to critical tools like Jira and Confluence.

Similarly, AT&T’s 2024 outage is yet another example of the widespread disruptions caused by poorly managed changes. Users across the United States experienced widespread service disruptions, including loss of internet access and the inability to make phone calls, even to emergency services like 911. The outage affected over 125 million devices, blocking more than 92 million voice calls and preventing over 25,000 attempts to reach 911 call centers. The root cause was an incorrect process applied during network expansion efforts—a network change involving an equipment configuration error. This incident highlights how a lack of peer review, inadequate testing, and insufficient safeguards (“change safety”) in implementing network changes can lead to massive service disruptions.

These incidents all underscore a common theme: without consistent change safety practices and standardized change safety tooling, failures are inevitable. The repercussions extend beyond immediate financial losses to long-term reputational damage and erosion of customer trust. Despite these high stakes, many organizations cling to makeshift safety measures, creating an illusion of operational safety.

The illusion of safety with homegrown tools and reactive approaches

Many organizations rely on homegrown tools and reactive approaches to ensure the change safety of their software deployments. These might include custom scripts, manual reviews, or "scream tests"—deploying changes and waiting to see if anyone reports a problem. While these methods can catch some issues, they create an illusion of safety that doesn't hold up under the increasing complexity and speed of modern software systems.

The status quo is fraught with risks:

Reactive nature: Problems are only addressed after they've affected users or systems, leading to downtime and potential data loss.
Incomplete coverage: Homegrown tools often lack the comprehensive checks needed to catch all failure modes, especially in complex, interdependent systems.
Inconsistency: Without standardized change safety measures, different teams may implement varying levels of protection, resulting in gaps that can be exploited by unforeseen errors.

‍Imagine updating a certificate management tool across your organization without standardized change safety checks. You deploy the change and wait. Days later, critical applications fail because the certificate renewal didn't go as planned. Your monitoring tools didn't flag it, and now you're scrambling to fix an outage that could have been prevented.

Why hasn't this been solved already?

The lack of standardization in change safety tooling is a significant obstacle that pushes safety to the back burner. Without a unified framework or industry-wide standards, organizations struggle to implement effective change safety measures, leading to inconsistent practices and increased risks of failures. This fragmentation hampers collaboration, knowledge sharing, and the ability to build upon proven methods, making software systems more vulnerable to errors and outages.

Several factors exacerbate this challenge:

Economic constraints and short-term priorities: Teams often focus on delivering features rapidly to stay competitive. Allocating resources to develop standardized change safety measures can seem impractical when the return on investment isn't immediately apparent. The emphasis on immediate results makes it difficult to justify the upfront costs of building and maintaining change safety tooling.

Rapidly changing software landscape: Software designs, versions, and configurations evolve at a rapid pace—sometimes within days or weeks. New tools and dependencies require continuous updates to safety checks. Investing heavily in safety measures for components that might soon change seems inefficient, causing teams to deprioritize this critical aspect.

‍Cultural resistance and knowledge gaps: Organizations may resist overhauling established processes, especially if current methods appear to be working "well enough." A lack of awareness about the benefits of standardized change safety tooling or uncertainty about how to implement it effectively can result in change safety being treated as an afterthought rather than a fundamental component of the change process.

Looking ahead

What got us here won't get us there. We've excelled at accelerating software delivery but have neglected the necessary change safety mechanisms to keep pace. The unchanging realities in software systems—mission-critical applications, the relentless demand for speed, increasing complexity, and the severe consequences of failures—demand a new approach.

A call to action:

For Engineering Leaders: Invest in change safety tooling and allocate resources to integrate these tools into your deployment pipelines. Foster a culture where safety is prioritized alongside innovation to prevent costly outages and build trust with customers.

For Engineers: Advocate for change safety by incorporating Safety Tooling early in the development process. Continuously educate yourself and your team on effective Safety Tooling to minimize the risk of breakages and failures.

For the Industry: Collaborate to establish unified safety standards, share best practices, and contribute to change safety initiatives.

For Policymakers: Mandate change safety controls, especially for mission-critical systems. Implement policies that require adherence to safety protocols to protect public interests and prevent widespread service disruptions.

A case for Operational Safety in software operations

The false choice between speed and safety

What won't change in software systems

Failures have real-world consequences

The illusion of safety with homegrown tools and reactive approaches

Why hasn't this been solved already?

Looking ahead

Continue reading

GKE Follows EKS & AKS, Launches Extended Support with a 500% Surcharge for Delayed Upgrade

AKS Long Term Support and EKS Extended Support: Similarities & Differences

Amazon launches EKS extended support… How does it impact you?

Ready to get started?

Chkk out our newsletter.

We value your privacy

A case for Operational Safety in software operations

The false choice between speed and safety

What won't change in software systems

Failures have real-world consequences

The illusion of safety with homegrown tools and reactive approaches

Why hasn't this been solved already?

Looking ahead

Continue reading

GKE Follows EKS & AKS, Launches Extended Support with a 500% Surcharge for Delayed Upgrade

AKS Long Term Support and EKS Extended Support: Similarities & Differences

Amazon launches EKS extended support… How does it impact you?

Ready to get started?

Chkk out our newsletter.