Hospitals and GPs couldn't treat patients, thousands of airplanes were grounded, and small businesses couldn't process transactions. The 2024 CrowdStrike outage, which affected over 8.5 million devices and is considered one of the largest IT outage in history, was triggered by a faulty software update. How do we make sure another CrowdStrike-like IT outage does not happen again? The answer is Operational Safety!
In a world where software powers everything from our cars to our healthcare systems, the impact of a single change to software and configuration has never been higher. While we've become experts at deploying updates rapidly, we've neglected to advance the Change Safety measures that ensure these changes don't lead to breakages and failures.
Imagine driving a car that's getting faster and more powerful every year, but the brakes and safety features remain outdated—or worse, each driver is responsible for crafting their own safety gear. Sounds risky, doesn't it? This is precisely the state of modern software deployments.
We've turbocharged our ability to deliver code, moving from custom scripts in the '90s to today's automated CI/CD pipelines and GitOps practices. Tools like Jenkins, Kubernetes, and GitHub have enabled us to execute millions of automated deployments daily. The change safety mechanisms that should accompany this speed have not kept pace. While teams often implement their own change safety measures—such as custom scripts, manual reviews, and homegrown testing—these ad-hoc approaches can result in unsafe and risky operations.
This lack of standardized change safety measures is akin to powering an old sedan with a Ferrari engine while neglecting the brakes. The car might be fast, but without reliable brakes, it's a disaster waiting to happen. To navigate this challenge, we need to acknowledge the unchanging realities that define our software landscapes.
There are unchanging realities in software systems that make standardized change safety tooling imperative:
Overlooking these constants isn't just theoretical; it leads to real-world disasters that affect millions.
Take the 2024 CrowdStrike outage as an example – what began as a routine software update quickly spiraled into one of the most devastating IT failures in recent memory. A flawed software update from the security vendor led to millions of Windows systems worldwide crashing with the infamous Blue Screen of Death (BSOD). This single change resulted in thousands of flights grounded, hospitals couldn't provide critical care, and disruptions for financial institutions and small businesses. Insurers estimate the outage will cost U.S. Fortune 500 companies around $5.4 billion.
The 2022 Atlassian outage provides another stark reminder of the catastrophic consequences that come with inadequate change safety measures. On April 5th, 2022, a script intended to delete a legacy app inadvertently deleted 883 customer sites, affecting 775 customers. Some experienced outages lasting up to 14 days, losing access to critical tools like Jira and Confluence.
Similarly, AT&T’s 2024 outage is yet another example of the widespread disruptions caused by poorly managed changes. Users across the United States experienced widespread service disruptions, including loss of internet access and the inability to make phone calls, even to emergency services like 911. The outage affected over 125 million devices, blocking more than 92 million voice calls and preventing over 25,000 attempts to reach 911 call centers. The root cause was an incorrect process applied during network expansion efforts—a network change involving an equipment configuration error. This incident highlights how a lack of peer review, inadequate testing, and insufficient safeguards (“change safety”) in implementing network changes can lead to massive service disruptions.
These incidents all underscore a common theme: without consistent change safety practices and standardized change safety tooling, failures are inevitable. The repercussions extend beyond immediate financial losses to long-term reputational damage and erosion of customer trust. Despite these high stakes, many organizations cling to makeshift safety measures, creating an illusion of operational safety.
Many organizations rely on homegrown tools and reactive approaches to ensure the change safety of their software deployments. These might include custom scripts, manual reviews, or "scream tests"—deploying changes and waiting to see if anyone reports a problem. While these methods can catch some issues, they create an illusion of safety that doesn't hold up under the increasing complexity and speed of modern software systems.
The status quo is fraught with risks:
Imagine updating a certificate management tool across your organization without standardized change safety checks. You deploy the change and wait. Days later, critical applications fail because the certificate renewal didn't go as planned. Your monitoring tools didn't flag it, and now you're scrambling to fix an outage that could have been prevented.
The lack of standardization in change safety tooling is a significant obstacle that pushes safety to the back burner. Without a unified framework or industry-wide standards, organizations struggle to implement effective change safety measures, leading to inconsistent practices and increased risks of failures. This fragmentation hampers collaboration, knowledge sharing, and the ability to build upon proven methods, making software systems more vulnerable to errors and outages.
Several factors exacerbate this challenge:
What got us here won't get us there. We've excelled at accelerating software delivery but have neglected the necessary change safety mechanisms to keep pace. The unchanging realities in software systems—mission-critical applications, the relentless demand for speed, increasing complexity, and the severe consequences of failures—demand a new approach.
A call to action: