Back to the blog
Change Safety
November 26, 2024

CrowdStrike outage was the symptom; missing Operational Safety was the cause

Written by
Fawad Khaliq
X logoLinkedin logo
Estimated Reading time
5 min

Earlier this year, a single faulty software update from CrowdStrike led to one of the largest IT outages in history, affecting over 8.5 million devices. Hospitals couldn't treat patients, thousands of airplanes were grounded, and small businesses couldn't process transactions. This event forces us to confront a critical question: How can we prevent such catastrophic failures in the future? I believe the answer lies in Operational Safety.

Operational Safety refers to the comprehensive set of standardized mechanisms, tools, and safeguards designed to ensure that all changes to software systems are implemented safely without causing breakages, degradation, or failures. It ensures that existing services run without disruption as systems evolve.

While we've become experts at deploying updates rapidly and at scale, we've neglected to advance the Operational Safety measures that ensure these changes don't lead to breakages and failures. To truly grasp the magnitude of the problem, let's examine how inadequate Operational Safety has led to real-world disasters.

Failures have real-world consequences

The 2024 CrowdStrike outage is a stark illustration of how inadequate Operational Safety can lead to catastrophic outcomes. Insurers estimate the outage will cost U.S. Fortune 500 companies around $5.4 billion. But this is not an isolated incident. Other organizations have faced similar crises due to insufficient safety measures.

The 2022 Atlassian outage provides another stark reminder of the catastrophic consequences that come with inadequate Operational Safety measures. On April 5th, 2022, a script intended to delete a legacy app inadvertently deleted 883 customer sites, affecting 775 customers. Some experienced outages lasting up to 14 days, losing access to critical tools like Jira and Confluence.

Similarly, AT&T’s 2024 outage underscores the widespread disruptions caused by poorly managed changes. Users across the United States experienced significant service interruptions, including loss of internet access and the inability to make phone calls—even to emergency services like 911. The outage affected over 125 million devices, blocked more than 92 million voice calls, and prevented over 25,000 attempts to reach 911 call centers. The root cause was an incorrect process applied during network expansion efforts—a network change involving an equipment configuration error. This incident emphasizes how a lack of peer review, inadequate testing, and insufficient safeguards in implementing network changes can lead to massive service disruptions.

These incidents underscore a common theme: without consistent Operational Safety practices and standardized safety tooling, failures are inevitable. The repercussions extend beyond immediate financial losses to long-term reputational damage and erosion of customer trust. These events raise a critical question: are organizations forced to choose between innovation and safety?

The false choice between speed and safety

Imagine driving a car that's getting faster and more powerful every year, but the brakes and safety features remain outdated—or worse, each driver is responsible for crafting their own safety gear. Sounds risky, doesn't it? This is precisely the state of modern software deployments.

We've turbocharged our ability to deliver code, moving from custom scripts in the '90s to today's automated CI/CD pipelines and GitOps practices. Tools like Jenkins, Kubernetes, and GitHub have enabled us to execute millions of automated deployments daily. The Operational Safety mechanisms that should accompany this speed have not kept pace. While teams often implement their own Operational Safety measures—such as custom scripts, manual reviews, and homegrown testing—these ad-hoc approaches can result in unsafe and risky operations.

This lack of standardized safety measures is akin to powering an old sedan with a Ferrari engine while neglecting the brakes. The car might be fast, but without reliable brakes, it's a disaster waiting to happen. Despite understanding that speed and safety can coexist, many organizations still rely on flawed approaches that give a false sense of security.

The illusion of safety with homegrown tools and reactive approaches

Many organizations rely on homegrown tools and reactive approaches to ensure the Operational Safety of their software deployments. These might include custom scripts, manual reviews, or "scream tests"—deploying changes and waiting to see if anyone reports a problem. While these methods can catch some issues, they create an illusion of safety that doesn't hold up under the increasing complexity and speed of modern software systems.

The status quo is fraught with risks:

  • Reactive nature: Problems are only addressed after they've affected users or systems, leading to downtime and potential data loss.
  • Incomplete coverage: Homegrown tools often lack the comprehensive Operational Safety checks needed to catch all failure modes, especially in complex, interdependent systems.
  • Inconsistency: Without standardized Operational Safety measures, different teams may implement varying levels of protection, resulting in gaps that can be exploited by unforeseen errors.

Imagine updating a certificate management tool across your organization without standardized Operational Safety checks. You deploy the change and wait. Days later, critical applications fail because the certificate renewal didn't go as planned. Your monitoring tools didn't flag it, and now you're scrambling to fix an outage that could have been prevented.

If these methods are inadequate, why do they persist? The answer lies in several underlying challenges.

Why hasn't this been solved already?

The lack of standardization in Operational Safety tooling is a significant obstacle that pushes safety to the back burner. Without a unified framework or industry-wide standards, organizations struggle to implement effective Operational Safety measures, leading to inconsistent practices and increased risks of failures. This fragmentation hampers collaboration, knowledge sharing, and the ability to build upon proven methods, making software systems more vulnerable to errors and outages.

Several factors exacerbate this challenge:

  • Economic constraints and short-term priorities: Teams often focus on delivering features rapidly to stay competitive. Allocating resources to develop standardized Operational Safety measures can seem impractical when the return on investment isn't immediately apparent. The emphasis on immediate results makes it difficult to justify the upfront costs of building and maintaining Operational Safety tooling.
  • Rapidly changing software landscape: Software designs, versions, and configurations evolve at a rapid pace—sometimes within days or weeks. New tools and dependencies require continuous updates to safety checks. Investing heavily in safety measures for components that might soon change seems inefficient, causing teams to deprioritize this critical aspect.
  • Cultural resistance and knowledge gaps: Organizations may resist overhauling established processes, especially if current methods appear to be working "well enough." A lack of awareness about the benefits of standardized Operational Safety tooling or uncertainty about how to implement it effectively can result in Operational Safety being treated as an afterthought rather than a fundamental component of the change process.

While these obstacles are significant, certain realities in software systems remain constant, making the need for Operational Safety unavoidable.

What won't change in software systems

There are unchanging realities in software systems that make standardized Operational Safety tooling imperative:

  • Software will continue to power critical services and infrastructure. Whether it's healthcare systems, financial transactions, or transportation networks, software is at the heart of it all.
  • Markets will always favor those who can innovate quickly. The demand for rapid delivery of new features and updates isn't slowing down. 
  • Complexity will continue increasing monotonically. The shift to microservices has fragmented applications into countless interconnected components. Dependencies on open-source libraries and SaaS solutions, along with rapidly evolving tech stacks driven by emerging technologies like LLMs, add layers of intricacy. This escalating complexity significantly heightens the risk of unforeseen interactions and failures.

Overlooking these constants isn't just theoretical; it leads to real-world breakages that affect mission-critical applications. 

What got us here won't get us there

We've excelled at accelerating software delivery but have neglected the necessary safety mechanisms to keep pace. The unchanging realities in software systems demand a new approach. It's time to rethink our strategies and prioritize Operational Safety as an integral part of our processes.

A Call to Action

  • For Engineering Leaders: Invest in Operational Safety tooling and allocate resources to integrate these tools into your deployment pipelines. Foster a culture where Operational Safety is prioritized alongside innovation to prevent costly outages and build trust with customers.
  • For Engineers: Advocate for Operational Safety by incorporating safety tooling early in the development process. Continuously educate yourself and your team on effective Operational Safety practices to minimize the risk of breakages and failures.
  • For the Industry: Collaborate to establish unified safety standards, share best practices, and contribute to Operational Safety initiatives.
  • For Policymakers: Develop and mandate Operational Safety controls, especially for mission-critical systems. Begin by defining what constitutes mission-critical systems and establish Operational Safety protocols. Implement policies that require adherence to these protocols to prevent widespread service disruptions and protect public interests.
Tags
Change Safety
Operational Safety
Book a Demo

Continue reading

Operational Safety

OpenAI’s Outage: The Complexity and Fragility of Modern AI Infrastructure on Kubernetes

by
Fawad Khaliq
Read more
News

EKS launches Auto Mode… How can you adopt it?

by
Ali Khayam
Read more
News

GKE Follows EKS & AKS, Launches Extended Support with a 500% Surcharge for Delayed Upgrade

by
Ali Khayam
Read more

Learn more about Chkk