Earlier this year, a single faulty software update from CrowdStrike led to one of the largest IT outages in history, affecting over 8.5 million devices. Hospitals couldn't treat patients, thousands of airplanes were grounded, and small businesses couldn't process transactions. This event forces us to confront a critical question: How can we prevent such catastrophic failures in the future? I believe the answer lies in Operational Safety.
Operational Safety refers to the comprehensive set of standardized mechanisms, tools, and safeguards designed to ensure that all changes to software systems are implemented safely without causing breakages, degradation, or failures. It ensures that existing services run without disruption as systems evolve.
While we've become experts at deploying updates rapidly and at scale, we've neglected to advance the Operational Safety measures that ensure these changes don't lead to breakages and failures. To truly grasp the magnitude of the problem, let's examine how inadequate Operational Safety has led to real-world disasters.
The 2024 CrowdStrike outage is a stark illustration of how inadequate Operational Safety can lead to catastrophic outcomes. Insurers estimate the outage will cost U.S. Fortune 500 companies around $5.4 billion. But this is not an isolated incident. Other organizations have faced similar crises due to insufficient safety measures.
The 2022 Atlassian outage provides another stark reminder of the catastrophic consequences that come with inadequate Operational Safety measures. On April 5th, 2022, a script intended to delete a legacy app inadvertently deleted 883 customer sites, affecting 775 customers. Some experienced outages lasting up to 14 days, losing access to critical tools like Jira and Confluence.
Similarly, AT&T’s 2024 outage underscores the widespread disruptions caused by poorly managed changes. Users across the United States experienced significant service interruptions, including loss of internet access and the inability to make phone calls—even to emergency services like 911. The outage affected over 125 million devices, blocked more than 92 million voice calls, and prevented over 25,000 attempts to reach 911 call centers. The root cause was an incorrect process applied during network expansion efforts—a network change involving an equipment configuration error. This incident emphasizes how a lack of peer review, inadequate testing, and insufficient safeguards in implementing network changes can lead to massive service disruptions.
These incidents underscore a common theme: without consistent Operational Safety practices and standardized safety tooling, failures are inevitable. The repercussions extend beyond immediate financial losses to long-term reputational damage and erosion of customer trust. These events raise a critical question: are organizations forced to choose between innovation and safety?
Imagine driving a car that's getting faster and more powerful every year, but the brakes and safety features remain outdated—or worse, each driver is responsible for crafting their own safety gear. Sounds risky, doesn't it? This is precisely the state of modern software deployments.
We've turbocharged our ability to deliver code, moving from custom scripts in the '90s to today's automated CI/CD pipelines and GitOps practices. Tools like Jenkins, Kubernetes, and GitHub have enabled us to execute millions of automated deployments daily. The Operational Safety mechanisms that should accompany this speed have not kept pace. While teams often implement their own Operational Safety measures—such as custom scripts, manual reviews, and homegrown testing—these ad-hoc approaches can result in unsafe and risky operations.
This lack of standardized safety measures is akin to powering an old sedan with a Ferrari engine while neglecting the brakes. The car might be fast, but without reliable brakes, it's a disaster waiting to happen. Despite understanding that speed and safety can coexist, many organizations still rely on flawed approaches that give a false sense of security.
Many organizations rely on homegrown tools and reactive approaches to ensure the Operational Safety of their software deployments. These might include custom scripts, manual reviews, or "scream tests"—deploying changes and waiting to see if anyone reports a problem. While these methods can catch some issues, they create an illusion of safety that doesn't hold up under the increasing complexity and speed of modern software systems.
The status quo is fraught with risks:
Imagine updating a certificate management tool across your organization without standardized Operational Safety checks. You deploy the change and wait. Days later, critical applications fail because the certificate renewal didn't go as planned. Your monitoring tools didn't flag it, and now you're scrambling to fix an outage that could have been prevented.
If these methods are inadequate, why do they persist? The answer lies in several underlying challenges.
The lack of standardization in Operational Safety tooling is a significant obstacle that pushes safety to the back burner. Without a unified framework or industry-wide standards, organizations struggle to implement effective Operational Safety measures, leading to inconsistent practices and increased risks of failures. This fragmentation hampers collaboration, knowledge sharing, and the ability to build upon proven methods, making software systems more vulnerable to errors and outages.
Several factors exacerbate this challenge:
While these obstacles are significant, certain realities in software systems remain constant, making the need for Operational Safety unavoidable.
There are unchanging realities in software systems that make standardized Operational Safety tooling imperative:
Overlooking these constants isn't just theoretical; it leads to real-world breakages that affect mission-critical applications.
We've excelled at accelerating software delivery but have neglected the necessary safety mechanisms to keep pace. The unchanging realities in software systems demand a new approach. It's time to rethink our strategies and prioritize Operational Safety as an integral part of our processes.
A Call to Action