Back to the blog
Technology
April 5, 2023

What Makes Kubernetes Upgrades So Challenging?

Written by
Fawad Khaliq
X logoLinkedin logo
Estimated Reading time
5 min

Our last 100+ conversations with DevOps/SREs are summarized in 4 nouns and 3 emotions: “Kubernetes Cluster Version Upgrades”…. “Hard, Pain, Work”. Why are Kubernetes upgrades so challenging? Why isn’t a Kubernetes upgrade as easy as an iPhone upgrade experience? Here’s what it makes it hard and why DevOps/SREs find change management stressful

  1. Kubernetes isn’t, and shouldn’t be, vertically integrated
  2. You don’t know what’ll break before it breaks
  3. Application performance impact is hard to quantify
  4. Stateful sets are pets… in a universe of cattle
  5. Rollbacks are a pain
  6. Components hit end-of-support/end-of-life frequently
  7. Getting an upgrade right takes a lot of time
  8. There is no way to learn from and avoid each others’ mistakes
  9. Communication between Platform & App teams isn’t streamlined
  10. Safe change management is hard. It’s 10x harder for infrastructure

Let’s dive into each

Kubernetes isn’t, and shouldn’t be, vertically integrated

K8s is designed for flexibility and cloud providers work hard to ensure this flexibility isn’t compromised. The solution is a cloud-owned k8s control plane (EKS, GKE, AKS, OKE …) with a few managed add-ons (e.g. CoreDNS, CNI …) and some guidance on how to build apps, while giving the flexibility of introducing new components/add-ons/apps to DevOps/SRE teams. The cost of this flexibility is that these DevOps/SRE teams must now own the lifecycle of the add-ons and the applications that run on top of the k8s infrastructure.

You don’t know what’ll break before it breaks

With so many moving pieces, it’s hard to know if your running k8s components have incompatibilities or latent risks. Many users use spreadsheets to track what they are running vs what they should be running, which is both painful and error prone. We all know that “Not broken != working-as-it-should”. Latent risks and unsupported versions may keep lurking around for weeks/months until they cause impact. What’s needed here is sharing the collective knowledge of the DevOps/SRE teams, so if one team has encountered an upgrade risk then everyone else just gets to avoid it without any extra work on their end. More on this in #8.

Application performance impact is hard to quantify

DevOps/SRE teams need to determine if the applications’ availability and performance will be impacted with an upgrade. How many times do you find yourself asking: “would my app’s first paint time degrade if I moved to this new kernel version that my cloud provider is recommending?” How many times can you answer this question conclusively?

Stateful sets are pets… in a universe of cattle

While k8s abstractions of running multiple replicas work well for stateless workloads, availability and data integrity of stateful sets can incur all sorts of issues during upgrades. I’ve seen fully-supported upgrade paths leading to downtime, data loss, response time increases, and transaction rate degradation. These issues are hard to debug and introduce cascading failures in downstream applications. Rollbacks are generally your best friend, but….

Rollbacks are a pain

In Amazon we used to have tools and metrics that would quantify performance and availability before, during and after a change. These tools also modeled each change so rollback at any steps can happen predictably, safely and without requiring expert judgment. A tool like this is missing for k8s. Meaningful canaries to capture change management impact are costly and difficult to operationalize. Modeling steps and stages in the upgrade path remains elusive. Restoring from backups is risky and scary.

Components hit end-of-support/end-of-life frequently

K8s has a vibrant ecosystem with thousands of vendors and open-source projects. Everyone is expected to upgrade their k8s infrastructure at least twice every year to get the latest and greatest features and bug-fixes. Similarly, every add-on encourages you to upgrade to a newer version every couple of months, with older versions going end-of-life/end-of-support. Combine it all and you are spending weeks if not months to patch bug-fix releases, track end-of-support/end-of-life events, decide when to upgrade to newer versions, determine which add-on versions work best with your current control plane version, ….

Getting an upgrade right takes a lot of time

Deliotte’s CIO survey estimates that 80% of DevOps/SRE time is spent in operations/maintenance, and only 20% is spent on innovation. I am not surprised as cooking up a “safe” upgrade plan is a huge time sink. You have to read an inordinate amount of text and code (on release notes, github issues/PRs, blogs, etc.) to really understand what’s relevant to you vs what’s not. This can take weeks of effort, which is time that you could’ve spent on business critical functions like architectural projects and infrastructure scaling/optimization.


There is no way to learn from and avoid each others’ mistakes

The DevOps/SRE community has a lot of knowledge but there is no way to share it with each other quickly and programmatically. As a result, mistakes are repeated as each team experiences many of the same risks as others in their own upgrade journey. This is a perfect example of what used to be called “undifferentiated heavylifting” at Amazon – i.e. muck that every developer has to do but doesn’t provide any visible business gain.

Communication between Platform & App teams isn’t streamlined

Many changes require application changes – e.g. moving apps off of an API that’ll be deprecated with the next upgrades. These changes end up becoming the long pole as DevOps/SRE teams need to inform the app teams, track the fixes as a dependency for their upgrades, and then remind/pester until it’s done. All this back-and-forth causes friction across teams.

Safe change management is hard. It’s 10x harder for infrastructure

Anyone who’s run infrastructure will tell you that every version bump is an availability risk and most version bumps end up becoming projects/workflows involving multiple team members–sometimes across different teams. This obsession with safety is critical as the blast radius of an infrastructure risk is really large. For instance, Reddit’s Pi Day outage was a network add-on misconfiguration that led to a 314 minute outage for a team that is deeply obsessed about availability. (The following picture tells the story of how seriously invested this team is in infrastructure availability–read their blog for more details.)

Reddit daily availability vs current SLO target.

Do any of the above pains resonate with you? I’d love to hear about your experiences with Kubernetes upgrades and what kind of challenges you have faced.

Tags
Upgrades
Book a Demo

Continue reading

Operational Safety

OpenAI’s Outage: The Complexity and Fragility of Modern AI Infrastructure on Kubernetes

by
Fawad Khaliq
Read more
News

EKS launches Auto Mode… How can you adopt it?

by
Ali Khayam
Read more
Change Safety

CrowdStrike outage was the symptom; missing Operational Safety was the cause

by
Fawad Khaliq
Read more

Learn more about Chkk