KubeCon NA 2023 in Chicago was another successful CNCF event, drawing thousands of attendees from around the world. It was great to catch up with community members, including operators, customers, and contributors, and meet new faces in the hallways. The city of Chicago, as always, was a beautiful and welcoming host, treating us with delicious food. Kudos to CNCF for another successful KubeCon.
More importantly, this conference was very different from the KubeCon’s I’ve attended before. Here is why:
Let’s dive into each.
The KubeCon spotlight shifted from new tools to Kubernetes' maturity and its pivotal role in mission-critical applications. Most of the 50+ attendees I spoke to and hundreds more that our team interacted with rely on Kubernetes for mission-critical workloads. The focus has evolved to prioritize security, availability, and operational excellence. While cloud providers and managed Kubernetes services contribute to this shift, challenges persist, particularly in managing the layers built upon the Kubernetes infrastructure.
The increasing dependence on open-source software was evident at KubeCon, encompassing key technologies like Linux, Kubernetes, Envoy, Cert Manager, Prometheus, and Karpenter, among hundreds of others. However, this reliance presents challenges as managing open-source software is complex, requiring in-depth knowledge of each layer and its dependencies. Despite being free in terms of licensing, open-source software entails a substantial total cost of ownership, including maintenance tasks like security patching and staying up to date with compatible and supported versions.
DevOps roles are evolving into Platform Engineering, with teams either merging or forming new dedicated units. This change takes a comprehensive approach, encompassing technology, people, processes, and culture. The primary objective of Platform Engineering is to empower internal application developers by enhancing their experience and productivity through self-service capabilities. Many enterprises have platform groups or tribes, each comprising multiple teams responsible for tasks such as cluster management, improving the developer experience, and managing specific stack layers. It's important to note that Platform Engineering doesn't have a one-size-fits-all definition or implementation; it adapts to each organization's unique requirements.
Most teams face challenges in discerning real threats from false positives in CVE scanning. Application teams constantly patch containers and libraries due to security team pressure, making it challenging to pinpoint real risks. This abundance of false positives leads to alert fatigue, with few addressing the underlying problem. Chainguard's approach to tackling the root of the problem (using minimal, distroless images) is brilliant and is a step in the right direction.
Teams are struggling to hire, train and retain top talent. Due to the nature of the tasks, there is unsustainable growth of headcount in proportion to infrastructure scaling. This approach runs counter to the vision of Platform Engineering teams, which ideally rely on automation rather than linear headcount increase. Yet, often, the urgency of maintenance and firefighting tasks overshadows the implementation of large-scale automation projects, leaving platform teams in a constant catch-up mode.
The dilemma is further exacerbated by the scarcity of specialized talent. Key roles like Cloud Engineer and DevOps Engineer are among the most in-demand in the tech sector, making hiring a challenging and often unviable strategy, particularly for organizations outside the big tech industry. On-the-job training for newer team members, while necessary, consumes the valuable time of experienced engineers, detracting from their core responsibilities of innovation and strategic development. Consequently, the most skilled professionals are constantly entangled in operational tasks, leading to burnout and dissatisfaction. This cycle often forces teams to turn to contractors as a stopgap measure, which brings its own set of challenges and doesn't fundamentally solve the issue of building a sustainable, skilled internal team.
The prevailing reactive approach in observability and incident management is fundamentally limited. Teams rely on tools like DataDog, Prometheus, Grafana, and others for post-failure analysis, but this falls short in preventing known errors and disruptions. While observability systems are efficient in mitigating immediate failures, they lack proactive capabilities. Organizations struggle to apply past incident learnings across the board, leading to repetitive failures and inefficiencies. The reactive approach, while essential for troubleshooting, lacks a systematic method for proactive learning and adaptation, highlighting a significant gap in current operational practices.
Startups offering LLM services and AI-optimized data centers are turning to or have already chosen Kubernetes.. However, GPUs remain a scarce resource. Data centers and cloud providers that possess GPUs are becoming prime migration targets for companies dependent on GPU-powered operations. Kubernetes is being utilized as a key infrastructure component in these environments, aligning with the needs of AI and LLM applications.
CNCF landscape flexibility leads to complex operational demands. The CNCF ecosystem boasts over 1,000 components. This flexibility, while beneficial, introduces overwhelming complexity in managing Kubernetes environments. A typical Kubernetes cluster includes around dozens of open-source or vendor managed add-ons, each with its own dependencies and release cycles that platform teams must manage. For enterprises running clusters across various cloud providers and Kubernetes distributions, the complexity multiplies, making even simple information retrieval tasks exceedingly complex.
This complexity results in platform teams being bogged down by mundane, repetitive tasks. Teams across the globe find themselves recreating spreadsheets and custom scripts to answer basic questions like add-on versions, compatibility information, upcoming upgrade schedules, and potential upgrade impacts on applications. This focus on routine operational tasks detracts from addressing more critical concerns like automation and developer experience for internal application teams. These mundane tasks, while necessary, divert valuable resources from more strategic and impactful work, underscoring a need for more efficient approaches to manage the operational complexity of Kubernetes environments.
In conclusion, as Kubernetes enters its second decade with a growing community of developers and operators, I'm very excited about its future. It was great connecting with former colleagues, friends and customers, and I look forward to seeing you all in Paris. Here are some photos from the event:
And if you're looking for official KubeCon photos, you can find them on this Flickr link: https://www.flickr.com/photos/143247548@N03/albums/72177720312486917