Back to the blog
Technology
November 15, 2023

Kubernetes Enters Its Second Decade: Insights from KubeCon Chicago

Written by
Fawad Khaliq
X logoLinkedin logo
Estimated Reading time
5 min

KubeCon NA 2023 in Chicago was another successful CNCF event, drawing thousands of attendees from around the world. It was great to catch up with community members, including operators, customers, and contributors, and meet new faces in the hallways. The city of Chicago, as always, was a beautiful and welcoming host, treating us with delicious food. Kudos to CNCF for another successful KubeCon.


More importantly, this conference was very different from the KubeCon’s I’ve attended before. Here is why: 

  1. Kubernetes has hit its “Linux moment”! The spotlight was less on new tools and more on how Kubernetes is being used to run mission-critical applications.
  2. As open-source software becomes more prevalent, so does the complexity of its maintenance.
  3. DevOps evolves into Platform Engineering – a balanced blend of people, culture, processes, and technology. 
  4. Teams struggle to discern real threats from a vast sea of false positives in security scanning.
  5. Hiring, training, and retaining talent is becoming increasingly challenging.
  6. Observability and incident management tools are essential but not sufficient.
  7. Kubernetes meets AI, but GPU scarcity remains a challenge.
  8. Mundane, repetitive operational tasks have become a norm - automation remains elusive.

Let’s dive into each.

1. Kubernetes has hit its “Linux moment”! The spotlight was less on new tools and more on how Kubernetes is being used to run mission-critical applications

The KubeCon spotlight shifted from new tools to Kubernetes' maturity and its pivotal role in mission-critical applications. Most of the 50+ attendees I spoke to and hundreds more that our team interacted with rely on Kubernetes for mission-critical workloads. The focus has evolved to prioritize security, availability, and operational excellence. While cloud providers and managed Kubernetes services contribute to this shift, challenges persist, particularly in managing the layers built upon the Kubernetes infrastructure.

2. As open-source software becomes more prevalent, so does the complexity of its maintenance

The increasing dependence on open-source software was evident at KubeCon, encompassing key technologies like Linux, Kubernetes, Envoy, Cert Manager, Prometheus, and Karpenter, among hundreds of others. However, this reliance presents challenges as managing open-source software is complex, requiring in-depth knowledge of each layer and its dependencies. Despite being free in terms of licensing, open-source software entails a substantial total cost of ownership, including maintenance tasks like security patching and staying up to date with compatible and supported versions.

3. DevOps evolves into Platform Engineering – a balanced blend of people, culture, processes, and technology 

DevOps roles are evolving into Platform Engineering, with teams either merging or forming new dedicated units. This change takes a comprehensive approach, encompassing technology, people, processes, and culture. The primary objective of Platform Engineering is to empower internal application developers by enhancing their experience and productivity through self-service capabilities. Many enterprises have platform groups or tribes, each comprising multiple teams responsible for tasks such as cluster management, improving the developer experience, and managing specific stack layers. It's important to note that Platform Engineering doesn't have a one-size-fits-all definition or implementation; it adapts to each organization's unique requirements.

4. Teams struggle to discern real threats from a vast sea of false positives in security scanning

Most teams face challenges in discerning real threats from false positives in CVE scanning. Application teams constantly patch containers and libraries due to security team pressure, making it challenging to pinpoint real risks. This abundance of false positives leads to alert fatigue, with few addressing the underlying problem. Chainguard's approach to tackling the root of the problem (using minimal, distroless images) is brilliant and is a step in the right direction.

5. Hiring, training, and retaining talent is becoming increasingly challenging

Teams are struggling to hire, train and retain top talent. Due to the nature of the tasks, there is unsustainable growth of headcount in proportion to infrastructure scaling. This approach runs counter to the vision of Platform Engineering teams, which ideally rely on automation rather than linear headcount increase. Yet, often, the urgency of maintenance and firefighting tasks overshadows the implementation of large-scale automation projects, leaving platform teams in a constant catch-up mode.

The dilemma is further exacerbated by the scarcity of specialized talent. Key roles like Cloud Engineer and DevOps Engineer are among the most in-demand in the tech sector, making hiring a challenging and often unviable strategy, particularly for organizations outside the big tech industry. On-the-job training for newer team members, while necessary, consumes the valuable time of experienced engineers, detracting from their core responsibilities of innovation and strategic development. Consequently, the most skilled professionals are constantly entangled in operational tasks, leading to burnout and dissatisfaction. This cycle often forces teams to turn to contractors as a stopgap measure, which brings its own set of challenges and doesn't fundamentally solve the issue of building a sustainable, skilled internal team.

6. Observability and incident management tools are essential but not sufficient

The prevailing reactive approach in observability and incident management is fundamentally limited. Teams rely on tools like DataDog, Prometheus, Grafana, and others for post-failure analysis, but this falls short in preventing known errors and disruptions. While observability systems are efficient in mitigating immediate failures, they lack proactive capabilities. Organizations struggle to apply past incident learnings across the board, leading to repetitive failures and inefficiencies. The reactive approach, while essential for troubleshooting, lacks a systematic method for proactive learning and adaptation, highlighting a significant gap in current operational practices.

7. Kubernetes meets AI, but GPU scarcity remains a challenge

Startups offering LLM services and AI-optimized data centers are turning to or have already chosen Kubernetes.. However, GPUs remain a scarce resource. Data centers and cloud providers that possess GPUs are becoming prime migration targets for companies dependent on GPU-powered operations. Kubernetes is being utilized as a key infrastructure component in these environments, aligning with the needs of AI and LLM applications. 

8. Mundane, repetitive operational tasks have become a norm - automation remains elusive

CNCF landscape flexibility leads to complex operational demands. The CNCF ecosystem boasts over 1,000 components. This flexibility, while beneficial, introduces overwhelming complexity in managing Kubernetes environments. A typical Kubernetes cluster includes around dozens of open-source or vendor managed add-ons, each with its own dependencies and release cycles that platform teams must manage. For enterprises running clusters across various cloud providers and Kubernetes distributions, the complexity multiplies, making even simple information retrieval tasks exceedingly complex.

This complexity results in platform teams being bogged down by mundane, repetitive tasks. Teams across the globe find themselves recreating spreadsheets and custom scripts to answer basic questions like add-on versions, compatibility information, upcoming upgrade schedules, and potential upgrade impacts on applications. This focus on routine operational tasks detracts from addressing more critical concerns like automation and developer experience for internal application teams. These mundane tasks, while necessary, divert valuable resources from more strategic and impactful work, underscoring a need for more efficient approaches to manage the operational complexity of Kubernetes environments.

See You Next Time

In conclusion, as Kubernetes enters its second decade with a growing community of developers and operators, I'm very excited about its future. It was great connecting with former colleagues, friends and customers, and I look forward to seeing you all in Paris. Here are some photos from the event:

Chicago Deep-Dish Pizza
A Chicago visit is incomplete without deep-dish pizza
Vintage eBPF SWAG with ex-teammates: Brenden Blanco, Eduard Serra and Gaetano Borgione
Gus Lees rocking the orange!
With ex-EKS teammate Micah Hausler and the EKS team
Hosting Bogomil Balkansky at Chkk booth
eBPF documentary
Ali with Giri and the team
Gotta love that Akuity shield!
Closing the week with human-foosball!

And if you're looking for official KubeCon photos, you can find them on this Flickr link: https://www.flickr.com/photos/143247548@N03/albums/72177720312486917

Tags
Kubernetes
KubeCon
Book a Demo

Continue reading

Operational Safety

OpenAI’s Outage: The Complexity and Fragility of Modern AI Infrastructure on Kubernetes

by
Fawad Khaliq
Read more
News

EKS launches Auto Mode… How can you adopt it?

by
Ali Khayam
Read more
Change Safety

CrowdStrike outage was the symptom; missing Operational Safety was the cause

by
Fawad Khaliq
Read more

Learn more about Chkk