Challenge
The Platform Engineering team at the Fortune 500 enterprise had a Kubernetes Platform on EKS that supported hundreds of software developers running thousands of applications. These applications were mission-critical because online transactions were a sizeable fraction of their revenue—conservative estimates modeled the cost of downtime at $3.2K per minute ($200K per hour).
EKS requires at least three Kubernetes upgrades every year and each add-on requires at least 1-2 upgrades per year. This posed the following challenges for the Platform team:
- 500% increase in cluster costs – EKS had launched Extended Support, and all clusters in Extended Support incurred a 6x surcharge, which amounted to approximately $500K in additional costs per year.
- Risk of forced upgrades – Falling behind on upgrade cycles could have resulted in forced upgrades—an extreme and disruptive event that posed a business continuity risk.
- Complex add-ons had to be upgraded separately and without disruption – To contain the blast radius during cluster upgrades, the Platform team needed to upgrade Contour, Envoy, Prometheus, Nginx, Redis, Kong, and other stateful add-ons separately from the clusters.
- Extensive coordination required with application teams – The Platform team needed to regularly communicate upgrade readiness with application teams, ensuring that they migrated off removed APIs with each upgrade cycle. This was a multi-step coordination process owned by a dedicated Technical Project Manager (TPM) and involving engineers, Directors, and VPs from both the Platform and application teams.
Solution
The Platform team implemented Chkk’s Operational Safety Platform to simplify upgrade management and ensure that the upgrades were disruption-free.
- Accelerating Upgrade Process – Chkk automated key upgrade tasks such as dependency analysis, release note processing, and impact assessment across hundreds of add-ons, cutting down research and planning time by up to 8x.
- Upgrade Copilot & Preverified Plans – Chkk’s Upgrade Copilot automated tedious pre-work and delivered Preverified Upgrade Plans for clusters and add-ons, tested on a digital twin of Dexcom’s infrastructure, ensuring safe, well-orchestrated upgrades. Separate Upgrade Plans for add-ons enabled the team to execute fleetwide upgrades of complex add-ons and soak them for months before performing cluster upgrades.
- Repeatable Upgrades with Curated Workflows – Chkk standardized workflows and enabled task delegation, reducing reliance on expert knowledge and making complex upgrades repeatable and efficient.
- Streamlining Communications with Application Teams – Chkk highlighted upgrade risks and considerations that could have caused application disruptions or stalled the upgrade. Each risk came with a Knowledge Base Article (KBA) detailing the risk, its severity and impact, which resources were impacted, and how to mitigate it. Each risk included its own workflows for notification and risk lifecycle management, improving coordination and reducing friction between the Platform and application teams.
"Managing Kubernetes upgrades at this scale was becoming unsustainable, with rising costs, complex add-on dependencies, and constant coordination challenges. Chkk gave us a structured, automated approach that not only kept us ahead of EKS releases but also eliminated nearly $500K in unnecessary costs while ensuring business continuity." — VP of Infrastructure
Outcomes
By implementing Chkk, this Fortune 500 enterprise achieved significant operational and financial benefits:
- Avoided $403K per yr in Extended Support costs, up to $500K per year
- 200% increase in upgrade productivity, ensuring business, regulatory, and compliance goals were met.
- 80% reduction in upgrade preparation time, eliminating weeks of manual research and validation.
- 4 FTEs Repurposed for High-Value Work – With Chkk handling upgrade complexity, skilled engineers were freed from routine, manual tasks and could focus on strategic initiatives.
- Improved operational efficiency – Platform team could focus on strategic initiatives rather than break-fix efforts.
- Seamless upgrade communications between platform and application teams – Chkk enabled clear, automated risk communication, reducing friction between the Platform and application teams and ensuring smooth, disruption-free upgrades.
"Before Chkk, upgrades were a painstaking process, requiring constant coordination with application teams and careful sequencing of add-on upgrades. With Chkk, we streamlined workflows, automated risk assessments, and improved communication—allowing us to execute upgrades with confidence and reclaim valuable engineering time." — Director, Platform Engineering.
Takeaways
- Frequent Kubernetes Releases Demand Proactive Upgrade Management – With EKS accelerating its release cycles, teams must adopt automation to stay ahead of deprecations, avoid forced upgrades, and prevent costly disruptions.
- Add-On Complexity Requires an Out-of-Band, Fleetwide Upgrade Strategy – Upgrading Kubernetes is not just about the control plane; mission-critical add-ons like Istio, Cilium, Keycloak, and Kafka require separate upgrade workflows to minimize risk and maintain stability.
- Clear Communication Reduces Upgrade Friction – Proactively identifying risks and providing structured, automated communication workflows between the Platform and application teams streamlines upgrades and minimizes disruptions.
- Automation Increases Efficiency and Reduces Costs – By automating upgrade planning, dependency analysis, and validation, Chkk helped the team reduce manual effort, repurpose skilled engineers for strategic work, and avoid nearly $500K in Extended Support costs.