Back to the blog
Platform Engineering
December 6, 2023

Platform teams need a delightfully different approach, not one that sucks less

Written by
Fawad Khaliq
Ali Khayam
X logoLinkedin logo
Estimated Reading time
7 min

Platform Engineering is emerging as the de facto method to deliver cloud-as-a-service to enterprise teams. Gartner estimates that by 2026, 80% of software engineering organizations will have established platform teams as internal providers of reusable services, components, and tools for application delivery. While this trend sounds elegant and straightforward, the reality is anything but.

The challenges that platform teams experience can be broadly classified into five buckets.

Challenge #1: Shared responsibility pushes complexity to platform teams.

Cloud providers are responsible for a small subset of the platform stack. Remember when cloud providers promised to eliminate all the muck with vertically-integrated infrastructure services so we could all disregard software versions, component compatibilities, testing/verification, and so on?

Then along came Kubernetes, a tool designed for flexibility that can’t, or at least shouldn’t, be vertically integrated. The result is a cloud-owned Kubernetes control plane (like EKS, GKE, AKS, and OKE) with a few managed add-ons (like CoreDNS, CNI, and KubeProxy) and some guidance on how to build apps. While this approach gives platform teams the flexibility to introduce new components, add-ons, and apps, this flexibility comes at a steep cost. Platform teams have found themselves responsible for the lifecycle of the add-ons and the applications that run on top of Kubernetes. This is commonly called the “shared responsibility model,” in which customers are expected to keep clusters up-to-date.

Application teams share the responsibility of keeping apps up-to-date, which causes friction and wastes time. In general, the slowest moving piece on a platform team’s plate requires application changes. It’s a vibrant ecosystem where APIs and software versions are regularly deprecated. Getting app teams aligned to prioritize changes on their side leads to friction and delays. As a result, platform teams don’t get enough time to test new changes, resulting in errors, disruptions, and failures–especially during upgrades.

Consequence: Stability (aka the status quo) wins over velocity and innovation. In this shared responsibility world, platform teams are now sandwiched between the cloud infrastructure and the application business logic with an impossible job: deliver better-than-yesterday features and scale to application teams, and make sure things never break. This disincentivizes velocity and innovation as these teams pick stability 9/10 times due to the friction and anxieties stemming from the shared responsibility model.

Challenge #2: Platforms are already complex… complexity is increasing monotonically.

Kubernetes is complex, yet flexible… but this great flexibility causes great pain. CNCF has an incredibly rich ecosystem of components and systems–1,000+ components, 150+ hosted projects, and 178,000+ contributors from 189 countries. It’s safe to say that CNCF has something for everyone.


This unprecedented flexibility comes with inordinate complexity. A typical cluster has at least 20 open-source add-ons with intricate dependencies and unique release cycles that must be managed by platform teams. A typical enterprise runs hundreds of clusters, with some already exceeding thousands of clusters. Throw in at least five major cloud providers and Kubernetes distributions and the complexity of even retrieving simple information increases exponentially.

Consequence: platform teams repeat undifferentiated heavy lifting to answer mundane questions. Every platform team on the planet rebuilds a spreadsheet, a wiki, a Slack channel, and some custom scripts to answer simple questions like: What Kubernetes versions are running on my clusters? How many versions of a particular add-on am I running and where? When is the next upgrade of a cluster due? Which applications will be impacted by this upgrade?

This undifferentiated heavy lifting takes platform teams away from answering the harder questions: Are all my add-ons and control plane components compatible with each other? Is the platform running any unsupported software versions? (After all, most open-source components just say “we only support the latest.”)

Challenge #3: Change is a constant… and all changes are availability risks.

There are too many drivers for change, yet only one team can affect change. There are at least four drivers that require platform teams to make changes regularly:

  1. Security and compliance teams expect that vulnerabilities are fixed within their prescribed SLAs.
  2. Application teams expect scaling, new feature delivery, and performance enhancements to happen within their own software delivery timelines.
  3. Cloud providers expect every cluster to be upgraded at least twice per year. 
  4. Add-on vendors release updates three to four times per year to keep up with the ecosystem. 

All this inflow of change must be ingested, prioritized, and executed by the platform team, as most of the above teams never talk to each other.

The biggest changes are Kubernetes upgrades, which are never simple. Kubernetes upgrades happen on every cluster at least twice per year and require months of prework and planning. Business leaders often wonder why Kubernetes upgrades can’t be like an iOS upgrade–tap a button, get a new version, and everything works. We detailed why that will never be possible with a Kubernetes ecosystem in this blog post.

Consequence: because changes cause disruptions, implementing changes takes forever. The easiest way to avoid breaking things is to do nothing, which is exactly what most teams end up doing. They know that even achieving simple tasks will lead to a lot of work and might still end up causing failures and disruptions so they take the “better safe than sorry” approach. Nobody wants to write yet another postmortem… As a result, the only tasks that get done are the ones that are mandates–e.g. Fixing critical security vulnerabilities, upgrading Kubernetes versions that are going out of support, and so on.

Challenge #4: Teams can’t automate and hire fast enough to keep up with platform growth and support mission-critical applications.

Even as your infrastructure scales, growing headcount is not sustainable. Growing headcount proportionally with your infrastructure works against the charter and vision of platform teams. Automation is the answer, but the business always prioritizes maintenance tasks and firefights, leaving no time to undertake major automation projects.

Pinning your hopes on hiring is a losing battle. Even if you can get the headcount, platform talent is incredibly scarce. Recent studies show that two of your directly needed roles are in the top three most wanted software jobs: Cloud Engineer and DevOps Engineer. So hiring is a hope, not a strategy—especially if you aren’t a Big Tech company. 

Training is too expensive and time-consuming. If you hire less experienced team members, most of the training happens on the job, which takes away precious time from your best engineers. 

The best team members are entangled in every firefight and every critical path task. Lacking a skilled workforce leads to the existing team members with the best judgment get entrenched in every task. Most of these are repeated tasks like reading release notes, following and engaging with open-source communities, and talking to cloud providers’ support, product, and engineering teams. This operational overhead leaves no time for innovation, architectural improvements or strategic thinking, which were supposed to be the main job for these team members. Before you know it, your best engineers become overworked and unhappy single-points-of-failure in the organization.

Consequence: Seeing unhappiness and fatigue in your team forces you to hire contractors, which has its own set of challenges and considerations.

Challenge #5: Reactive incident response is necessary but insufficient.

Platform teams today depend on observability, monitoring, and alerting systems to reduce response latency, make firefighting more efficient, and minimize a failure’s impact. Unfortunately, this leads to a myriad of problems.

This approach means that any potential mitigations first require experiencing the pain of an incident. Furthermore, a human has to invest their time to sift through a lot of information online to understand if someone else has had the same failure and, if so, how they fixed it. The whole process depends on an incredible amount of manual work and engineering resources, without being able to guarantee similar errors and disruptions won’t occur again.

Consequence: Firefights are a way of life and automation always take the back seat.

There has to be a better way…

It seems impossible for a single company’s platform team to solve these chronic challenges, but we believe it’s possible if we enable platform teams to “collectively learn” from each other, tap into each others’ wisdom, and avoid each others’ mistakes. A technological solution to achieve this goal should ensure that:

  • Platform engineers can learn from the unstructured information available on the internet without having to do superhuman things like reading walls and walls of text just to update a single component in the infrastructure, tracking versions through CLIs and APIs, talking to GitHub project maintainers, etc. 
  • Silos are broken across different platform teams so that learnings (risks, post-mortems, root causes, etc.) are seamlessly and programmatically shared without any extra effort on any one team’s part. This problem has been solved in other domains–e.g. CVE is a single authoritative source that allows all security teams to collaborate and learn from each other. We need a “CVE for Availability”.  

Solving these challenges requires a “trusted broker” of information that can collect information from all the users as well as public sources on the internet, validate that this information is relevant and accurate, curate this information as programmatic signatures, and publish it broadly for everyone’s benefit.

Chkk’s mission is to bring such a trusted broker to platform teams.

Tags
No items found.
Book a Demo

Continue reading

Operational Safety

OpenAI’s Outage: The Complexity and Fragility of Modern AI Infrastructure on Kubernetes

by
Fawad Khaliq
Read more
News

EKS launches Auto Mode… How can you adopt it?

by
Ali Khayam
Read more
Change Safety

CrowdStrike outage was the symptom; missing Operational Safety was the cause

by
Fawad Khaliq
Read more

Learn more about Chkk