Incident.io is redefining what an incident can be
A conversation with Chris Evans, CTO at Incident.io.
Chris Evans from incident.io has a phrase that captures their philosophy: "Move fast when you break things." It's a riff on the old Facebook motto, but with an important twist. Breaking things is inevitable—the question is how quickly you can respond.
incident.io is an incident management platform that WorkOS uses internally. Chris and the WorkOS team have been friends for years. The company helps engineering teams find issues, fix issues, and prevent issues—from paging engineers at 2 AM to communicating with customers on status pages.
Software breaks all the time
Here's something most people don't intuit: software goes wrong at every company, constantly. Even ChatGPT—one of the most impressive consumer products in recent memory—is "up and down like a yo-yo." They're incident.io customers; their status page runs on the platform.
This isn't a criticism of OpenAI. It's the reality of operating software at scale. Having incidents is actually a privilege, a marker of success. If your incidents matter enough that you need tooling to respond to them effectively, you've built something people depend on.
The Slack-first workflow
What makes incident.io work is how easy it is to create incidents. Slash command in Slack, incident created, channel spun up, on-call engineer paged. The friction is so low that WorkOS started creating incidents for things that weren't even outages—billing issues with key customers, anything that required urgent reactive work.
This is emergent behavior incident.io deliberately enabled. They define incidents as "urgent reactive work"—something unplanned that you have to jump on immediately. If the cost of creating an incident is incredibly low and the benefit is coordination and data, why not use it for anything that fits that pattern?
From quarterly crisis to continuous response
The old model of incident management was quarterly catastrophe. A major cloud provider goes down, you scramble to spin up a Jira ticket and a Google Doc, nobody knows the process because it happens so infrequently. It's scary and clunky.
incident.io flipped this. They have dozens of incidents per day internally. Every Sentry error that affects a customer creates an incident automatically, pages the relevant engineer, and provides a direct link to the customer's Slack channel. Engineers proactively reach out to customers before the customers even notice something went wrong.
It's analogous to the shift from seasonal deployments to continuous deployment. When deploying code was a big scary event, nobody got practice. Now it happens constantly and nobody thinks twice.
AI that took 18 months to get right
incident.io has AI summaries—wake up, catch up on the incident without back-scrolling through 100 messages. But their more ambitious AI investment is in root cause analysis. They've spent 18 months building a multi-agent system that investigates incidents: checking Grafana, checking logs, checking deploys, forming hypotheses about probable causes.
Two weeks in, the team was high-fiving. The prototype looked incredible. Then they used it on their own incidents and it was "utter garbage"—confidently wrong, obvious suggestions, no way to debug what it was seeing.
The lesson: it's easy to produce a convincing demo. It's hard to build a product that actually works. They invested heavily in ground-truthing, labeling the last hundred incidents to measure whether the AI was actually improving. Without that evaluation infrastructure, they couldn't tell signal from noise.
This interview was conducted at AWS re:Invent 2025.