Our last 100+ conversations with DevOps/SREs are summarized in 4 nouns and 3 emotions: “Kubernetes Cluster Version Upgrades”…. “Hard, Pain, Work”. Why are Kubernetes upgrades so challenging? Why isn’t a Kubernetes upgrade as easy as an iPhone upgrade experience? Here’s what it makes it hard and why DevOps/SREs find change management stressful
Let’s dive into each
K8s is designed for flexibility and cloud providers work hard to ensure this flexibility isn’t compromised. The solution is a cloud-owned k8s control plane (EKS, GKE, AKS, OKE …) with a few managed add-ons (e.g. CoreDNS, CNI …) and some guidance on how to build apps, while giving the flexibility of introducing new components/add-ons/apps to DevOps/SRE teams. The cost of this flexibility is that these DevOps/SRE teams must now own the lifecycle of the add-ons and the applications that run on top of the k8s infrastructure.
With so many moving pieces, it’s hard to know if your running k8s components have incompatibilities or latent risks. Many users use spreadsheets to track what they are running vs what they should be running, which is both painful and error prone. We all know that “Not broken != working-as-it-should”. Latent risks and unsupported versions may keep lurking around for weeks/months until they cause impact. What’s needed here is sharing the collective knowledge of the DevOps/SRE teams, so if one team has encountered an upgrade risk then everyone else just gets to avoid it without any extra work on their end. More on this in #8.
DevOps/SRE teams need to determine if the applications’ availability and performance will be impacted with an upgrade. How many times do you find yourself asking: “would my app’s first paint time degrade if I moved to this new kernel version that my cloud provider is recommending?” How many times can you answer this question conclusively?
While k8s abstractions of running multiple replicas work well for stateless workloads, availability and data integrity of stateful sets can incur all sorts of issues during upgrades. I’ve seen fully-supported upgrade paths leading to downtime, data loss, response time increases, and transaction rate degradation. These issues are hard to debug and introduce cascading failures in downstream applications. Rollbacks are generally your best friend, but….
In Amazon we used to have tools and metrics that would quantify performance and availability before, during and after a change. These tools also modeled each change so rollback at any steps can happen predictably, safely and without requiring expert judgment. A tool like this is missing for k8s. Meaningful canaries to capture change management impact are costly and difficult to operationalize. Modeling steps and stages in the upgrade path remains elusive. Restoring from backups is risky and scary.
K8s has a vibrant ecosystem with thousands of vendors and open-source projects. Everyone is expected to upgrade their k8s infrastructure at least twice every year to get the latest and greatest features and bug-fixes. Similarly, every add-on encourages you to upgrade to a newer version every couple of months, with older versions going end-of-life/end-of-support. Combine it all and you are spending weeks if not months to patch bug-fix releases, track end-of-support/end-of-life events, decide when to upgrade to newer versions, determine which add-on versions work best with your current control plane version, ….
Deliotte’s CIO survey estimates that 80% of DevOps/SRE time is spent in operations/maintenance, and only 20% is spent on innovation. I am not surprised as cooking up a “safe” upgrade plan is a huge time sink. You have to read an inordinate amount of text and code (on release notes, github issues/PRs, blogs, etc.) to really understand what’s relevant to you vs what’s not. This can take weeks of effort, which is time that you could’ve spent on business critical functions like architectural projects and infrastructure scaling/optimization.
The DevOps/SRE community has a lot of knowledge but there is no way to share it with each other quickly and programmatically. As a result, mistakes are repeated as each team experiences many of the same risks as others in their own upgrade journey. This is a perfect example of what used to be called “undifferentiated heavylifting” at Amazon – i.e. muck that every developer has to do but doesn’t provide any visible business gain.
Many changes require application changes – e.g. moving apps off of an API that’ll be deprecated with the next upgrades. These changes end up becoming the long pole as DevOps/SRE teams need to inform the app teams, track the fixes as a dependency for their upgrades, and then remind/pester until it’s done. All this back-and-forth causes friction across teams.
Anyone who’s run infrastructure will tell you that every version bump is an availability risk and most version bumps end up becoming projects/workflows involving multiple team members–sometimes across different teams. This obsession with safety is critical as the blast radius of an infrastructure risk is really large. For instance, Reddit’s Pi Day outage was a network add-on misconfiguration that led to a 314 minute outage for a team that is deeply obsessed about availability. (The following picture tells the story of how seriously invested this team is in infrastructure availability–read their blog for more details.)
Do any of the above pains resonate with you? I’d love to hear about your experiences with Kubernetes upgrades and what kind of challenges you have faced.