Challenge
The Platform Engineering team at Fortune 1000 Enterprise had built a Platform on EKS and GKE, supporting hundreds of software developers running thousands of applications. These applications were mission-critical, as online transactions constituted a sizeable fraction of their revenue. The platform ran 30+ complex, open-source add-ons, including Istio/Envoy for Service Mesh, Cilium for Networking & Security, HashiCorp Vault, Gloo, Redis, and Postgres.
GKE and EKS require at least three Kubernetes upgrades every year and each add-on requires at least 1-2 upgrades per year. This continuous upgrade treadmill created several challenges for the Platform team:
- 500% increase in cluster costs – GKE and EKS had launched Extended Support, and all clusters not upgraded in time incurred a 6x surcharge.
- Risk of forced upgrades – Falling behind on upgrade cycles could have resulted in forced upgrades—an extreme and disruptive event that posed a business continuity risk.
- Complex add-ons had to be upgraded separately and without application disruption – To contain the blast radius during cluster upgrades, the Platform team needed to upgrade Istio, Cilium, Vault and other add-ons separately from the clusters. The team also had to ensure zero downtime during each upgrade cycle.
- Standardization of a multi-cloud upgrade strategy – GKE and EKS upgrades had diverged into snowflakes, with different team members acting as Subject Matter Experts (SMEs) for each cloud. The Platform team sought to standardize the upgrade process and eliminate dependence on SMEs.
Solution
The Platform Team implemented Chkk’s Operational Safety Platform to simplify upgrade management and ensure that upgrades were disruption-free.
- Accelerating Upgrade Process – Chkk automated key upgrade tasks such as dependency analysis, release note processing, and impact assessment across hundreds of add-ons, cutting down research and planning time by up to 8x.
- Upgrade Copilot & Preverified Plans – Chkk’s Upgrade Copilot automated tedious pre-work and delivered Preverified Upgrade Plans for clusters and add-ons, tested on a digital twin of Dexcom’s infrastructure, ensuring safe, well-orchestrated upgrades. Separate Upgrade Plans for add-ons enabled the team to execute fleetwide upgrades of complex add-ons and soak them for months before performing cluster upgrades.
- Repeatable Upgrades with Curated Workflows – Chkk standardized workflows and enabled task delegation, reducing reliance on expert knowledge and making complex upgrades repeatable and efficient.
- A Standardized, Single Pane-of-Glass for All Upgrades – Chkk Upgrade Copilot became a single pane-of-glass for EKS and GKE upgrades. Customizations in Upgrade Plans ensured that expert knowledge was codified into the Upgrade Plans and carried over for future upgrades. All past upgrades were also available as a system of record, ensuring posterity as upgrade decision-making and collaboration were accessible to all team members.
- Conformance to Operational Safety Guardrails – Platform team used Chkk’s Guardrails to update hundreds of Helm charts owned by application teams, ensuring conformance to safety primitives at the source of their software development lifecycle.
"Chkk has transformed our Kubernetes upgrade strategy, eliminating costly delays and manual overhead while ensuring a standard multi-cloud upgrade process that works across EKS and GKE. With Chkk, we’ve not only reduced upgrade costs but also improved business continuity by proactively preventing forced upgrades and disruptions." — Director of Infrastructure
Outcomes
By implementing Chkk, this Fortune 500 Platform achieved significant operational and financial benefits:
- Avoided 6x cost increase in Extended Support costs.
- 200% increase in upgrade productivity, ensuring business, regulatory, and compliance goals were met.
- 80% reduction in upgrade preparation time, eliminating weeks of manual research and validation.
- 2 FTEs Repurposed for High-Value Work – With Chkk handling upgrade complexity, skilled engineers were freed from routine, manual tasks and could focus on strategic initiatives.
- Improved operational efficiency – Platform team could focus on strategic initiatives rather than break-fix efforts.
- Eliminated Upgrade Bottlenecks and Knowledge Silos – Chkk enabled multiple team members to take ownership of complex add-on upgrades, breaking reliance on a handful of experts and allowing work to be parallelized.
- Standardization of Operational Safety Guardrails – Chkk’s Guardrails were used to define a “conformance standard” that all application teams adopted, making safety a key primitive throughout the software development lifecycle.
"Before Chkk, I was the bottleneck for every complex add-on upgrade, stuck in a cycle of manual work and firefighting. With Chkk’s workflows, I could finally delegate upgrades to other team members who were eager to dive into the challenges. We parallelized upgrades across the team, making the process faster, more efficient, and no longer dependent on a few experts." — Principal Platform Engineer.
Takeaways
- Frequent Kubernetes Releases Demand Proactive Upgrade Management – With EKS accelerating its release cycles, teams must adopt automation to stay ahead of deprecations, avoid forced upgrades, and prevent costly disruptions.
- Add-On Complexity Requires an Out-of-Band, Fleetwide Upgrade Strategy – Upgrading Kubernetes is not just about the control plane; mission-critical add-ons like Istio, Cilium, Keycloak, and Kafka require separate upgrade workflows to minimize risk and maintain stability.
- Breaking Bottlenecks Enables Scale and Efficiency – Chkk empowered more team members to take on complex upgrades, eliminating reliance on a few experts and allowing work to be parallelized for faster execution.
- Automation Turns Upgrades from a Burden into a Competitive Advantage – By streamlining upgrades, reducing manual effort, and ensuring smooth transitions, Chkk helped the team focus on innovation rather than firefighting infrastructure changes.