Self-driving production: Autonomous agents for incident response

Traversal CEO Anish Agarwal explains how autonomous agents troubleshoot production incidents at scale: from world models to L5 autonomy. Interview from HumanX 2026.

Conner Simmons

April 15, 2026

Explore with AI

Open in ChatGPT

Open in Claude

Open in Perplexity

Nobody wants to be paged at 2 a.m. But for most engineering organizations, that's still the default. A production system breaks, people pile into a war room, and someone starts sifting through massive volumes of logs trying to figure out what went wrong. Hours burned. Engineers fried. And the root cause is almost always something that, in retrospect, looks obvious.

At HumanX 2026, Michael Grinich sat down with Anish Agarwal, co-founder and CEO of Traversal, to talk about what happens when you point autonomous agents at this problem — and why the self-driving car analogy works as a framework for production autonomy.

The production world model

At the core of Traversal is something they call a production world model. Not a dashboard. Not a monitoring tool. A structured representation of your entire production system — every component, every dependency, every relationship between them.

The same way world models power self-driving cars and robotics, this model powers autonomous incident troubleshooting. When something breaks, the agent already has the full map. It knows what's connected to what. It reasons about cause and effect, not just correlation.

The hard engineering problem: large enterprises can produce on the order of a petabyte of telemetry data per day. You can't stuff that into an LLM context window. The cost alone would be prohibitive. Traversal built what Anish calls an AI-native compressor — targeting 1,000:1 compression, petabyte to terabyte, without losing actionable signal. Observability data is massively redundant. The same log patterns fire over and over. Strip the redundancy, preserve the signal, hand it to the model.

That compressed, structured world model is what makes autonomous troubleshooting possible at enterprise scale.

L0 to l5: The same autonomy ladder as cars

Traversal maps production autonomy directly onto the SAE self-driving framework:

Note: this L0–L5 mapping is Traversal's own analogy, not an industry standard. The SAE levels for vehicles define specific handoff semantics between human and machine that don't translate directly to incident response, but the progression from manual to fully autonomous is conceptually parallel.

L4 to l5 is a people problem, not a tech problem

This was the most surprising part of the conversation: according to Anish, getting from L4 to L5 is about 80% change management, 20% technology. The technical gap isn't that large. The organizational gap is.

Traversal's path with customers typically looks like this:

According to Traversal, three months in, most customers are ready to let it run.

The part that pushes back on the usual narrative: SREs actually want this. The concern about job displacement hasn't played out at Traversal's customers so far. Engineers want to design resilient systems for the next five years, not firefight every 30 seconds. Every time Traversal self-heals an incident, engineers lean in further.

The Zoox comparison lands here. Yes, it's weird getting into a car with no steering wheel. Then it takes you where you're going, and the weirdness fades fast.

The frontier: Causal reasoning

The next hard problem isn't autonomy, it's causality.

LLMs can surface correlations from large datasets effectively. They are still unreliable at determining cause and effect from time series data — a task that requires interventional or counterfactual reasoning, not just pattern matching over observed distributions. Anish knows this problem well — his research at MIT and Columbia has focused on causal machine learning.

The models are improving, but causal reasoning is a fundamentally different capability than statistical pattern matching. Getting it right in production environments — where the haystack is logs, metrics, config files, and dashboards — remains an open problem.

When it is solved, the applications extend well beyond SRE. The same needle-in-a-haystack problem shows up in security, networking, product analytics, and business operations. Traversal is focused on production today, but the surface area is large.

The missing piece of the AI development story

Every CIO right now is trying to automate the full software development lifecycle. Code generation, code review, testing, deployment. It's getting there.

Production is the gap that hasn't received the same level of AI investment. You can have AI-generated code commits, autonomous CI/CD, and AI-assisted code review — and still have humans manually sifting through logs at 3 a.m. when something breaks in prod. That's the bottleneck that comes next.

The productivity gains from AI-generated code don't fully materialize until production is resilient enough to handle the increased deployment velocity. If you're shipping code 10x faster but your incident response is still manual, you've moved the bottleneck downstream.

Traversal is building exactly that missing layer. Nobody wants to be paged at 2 a.m. Soon, nobody will have to be.

This interview was recorded at HumanX 2026 in San Francisco.

We’re hiring

Our global team is growing and we’re hiring all types of roles.

View open roles

About us

WorkOS builds developer tools for quickly adding enterprise features to applications.

Learn more