We have worked together for a decade to build and operationalize cloud infrastructure services for customers that included startups, scale-ups, enterprises, and service providers. Chkk’s inspiration came from our time at AWS, where, as an EKS engineer, Fawad Khaliq saw firsthand how errors, disruptions, and failures in Kubernetes environments were repeated across customers— especially in the customer-owned platform layers–with no way for them to learn from each other. At the same time, Ali Khayam and Awais Nemat were operating network infrastructure services that had an “always available, no downtime” expectation. Ali and Awais experienced firsthand how vital to availability a proactive approach is for mission-critical services running one of the biggest infrastructure footprints on the planet.
Another aspect became obvious to us while working with thousands of AWS customers: incidents and outages happened repeatedly due to the same root causes, across different enterprises, at different points in time. Most of these enterprises reactively responded to the same set of issues that someone else had already dealt with. There was no easy way to find out, a priori, that there were known Availability Risks lurking in the infrastructure that can cause errors, disruptions, and failures.
This inspired us to start Chkk and democratize the collective wisdom of operating infrastructure at scale for everyone. The internet is a goldmine of information (with open-source projects, blogs, Slack threads, etc.) about failures that platform, DevOps, and SRE teams have experienced and how to prevent these failures from happening. But there is too much information out there for any one team or organization to consume and there is no solution to programmatically share the resulting learnings with other teams. This is Chkk’s mission: to enable developers to proactively prevent incidents from happening by learning from others and not repeating known mistakes.