Back to the blog
Company
October 25, 2023

Chkk Launches the Kubernetes Availability Platform

Written by
Awais Nemat
X logoLinkedin logo
Estimated Reading time
5 min

Today, my team and I are excited to publicly launch the Chkk Kubernetes Availability Platform. We want to thank our early customers and design partners that have been working with us very closely since February, when we opened our waitlist to anyone interested in proactively addressing infrastructure errors, disruptions, and failures.

I'm humbled that our customers love Chkk, already used by Alef Edge, Fairtiq, Nexoya, Phi Labs, Yoti, and many others. Thank you for helping us validate our thesis and develop our product.

Our core thesis

While working at AWS, we observed a recurring pattern: different enterprises, at different points in time, experienced the same errors, failures, and disruptions due to the same root causes. Every company reactively responded to the same set of issues that other companies had already dealt with. There was no easy way for any of them to find out, a priori, about known Availability Risks lurking in their infrastructure that can trigger incidents leading to downtime. We realized that there was an opportunity to help our future customers.

Our thesis was: 

  1. Customers care about availability and want to proactively prevent errors and not wait until after the impact, which wastes time and effort, and risks reputation and credibility.
  2. If an Availability Risk has already materialized into a disruption somewhere in the world, it is highly likely that it will materialize over and over again in many enterprises, and cause operational pain and loss.
  3. Customers want to learn and not repeat a mistake that has already caused others harm, but there is no simple, automated, and trusted way for them to learn from each other and avoid known risks.

That's where Chkk comes in.

We took inspiration from cybersecurity, where security vulnerabilities are reported publicly, and came up with this simple idea: If there's any error, failure, or disruption that has happened anywhere in the world, we will learn about it. We’ll convert it into an Availability Risk Signature, similar to a virus signature, and then we will stream it to all our customers, where it will be scanned in their environments. That way, our customer can proactively detect, identify, and remediate Availability Risks before they cause disruptions, much like antivirus software detects and removes viruses before they start causing harm.


With Chkk, our customers learn about Availability Risks from an authoritative source and proactively prevent these incidents from happening altogether.

How the Chkk Kubernetes Availability Platform works

Our first product is a SaaS service designed for organizations that are running mission-critical applications on Kubernetes infrastructure. We help them reduce Availability Risks, prevent errors and disruptions, and operate Kubernetes safely and efficiently. Not only do we identify and prioritize risks, we also provide Preverified Upgrade Plans to our customers, so they can cut down weeks of preparation prework to days, and safely remediate these risks without worrying about the complexities and intricate interdependencies that exist when fixing these issues.

There are three distinct modules in the Chkk Kubernetes Availability Platform.

Risk Ledger is similar to security risk ledgers, but tailored specifically towards identifying contextualized Availability Risks within Kubernetes infrastructures. It enables our customers to become proactive in addressing potential failures before they happen.

Artifact Register maintains an inventory of all components, container images, repositories, and tools across multiple clusters and clouds. It gives our customers visibility into what exists where, reducing the need for manual and error-prone tracking using spreadsheets and scripts that they currently use.

Upgrade Copilot is especially valuable for Platform, DevOps, and SRE Engineers responsible for planning and executing infrastructure upgrades. We provide Preverified Upgrade Plans containing a detailed sequence of steps that need to be executed for remediation. We then optionally verify these steps on a digital twin of their infrastructure, executing the prescribed sequence of steps, to validate that the plan works as expected. This significantly reduces the time and effort required for planning these upgrades and also derisks the execution of this critical task for our customers.

All modules of Chkk seamlessly integrate with existing workflows and tools (IaC, packaging, deployment, monitoring, ticketing, and alerting) and simplify existing operational processes.

Powered by Collective Learning

Many of our customers ask us: how do you learn about all the issues and Availability Risks? How do you make sure that your remediations and Upgrade Plans are safe to execute? How do you manage these intractable problems? What’s the magic?

The magic is our Collective Learning Technology.

At the heart of Collective Learning is the Availability Risk Signature Database, or ARSig DB. Think of it as a CVE database for Availability Risks, along with a Knowledge Graph that captures all the relationships across different artifacts – issues, release notes, and any and all breaking changes.

On the backend, our technology continuously sources and populates this ARSig DB and Knowledge Graph from multiple sources. First and foremost, we mine the internet for publicly available information – incidents, reports, tickets, issues, and discussions on internet forums. We scour everything where we can find a signal. Our research team then validates these candidates and converts them into programmatic signatures that can later be scanned and contextualized against a customer’s infrastructure.

We  also ingest release notes, breaking changes, and bug report feeds from Kubernetes add-on vendors and open-source projects into our ARSig DB and the Knowledge Graph. And of course, we also learn from our users. We continuously add these learnings to our ARSig DB and Knowledge Base, which become more valuable for our customers over time.

All Chkk modules use the Database and Knowledge Graph to identify and prioritize risks, and locate them with pinpoint accuracy within a Kubernetes fleet. We also use them to create and preverify the Upgrade Plans that our customers use to remediate these issues.

We’ve taken time and care to build and fine tune the technology to prioritize and address the right risks. Our customers appreciate that we offer concise actionable plans to resolve the most critical risks, rather than burdening them with an exhaustive list of unnecessary ones.

A bright future ahead

In order to build a future powered by Collective Learning, Chkk has raised $5.2 million in seed funding from angels and VCs led by Sequoia Capital. We are grateful that Sequoia believes in our mission and is joining us in democratizing the wisdom of operating software at scale.

We have built the Chkk Kubernetes Availability Platform for our customers running mission-critical apps on Kubernetes infrastructure. It helps Platform, DevOps, and SRE teams proactively manage and remediate risks, execute safe upgrades, eliminate wasted effort, and accomplish more with fewer resources. 

The Chkk Kubernetes Availability Platform is available today – it installs in minutes and integrates into your existing tools and workflows. Please sign up to get started.

Tags
No items found.
Book a Demo

Continue reading

Operational Safety

OpenAI’s Outage: The Complexity and Fragility of Modern AI Infrastructure on Kubernetes

by
Fawad Khaliq
Read more
News

EKS launches Auto Mode… How can you adopt it?

by
Ali Khayam
Read more
Change Safety

CrowdStrike outage was the symptom; missing Operational Safety was the cause

by
Fawad Khaliq
Read more

Learn more about Chkk