Maxim Fateev on why durable execution matters for AI agents

WorkOS CEO Michael Grinich interviews Temporal co-founder Maxim Fateev on durable execution, AI agent reliability, and why workflows need to survive failures.

Conner Simmons

April 15, 2026

Explore with AI

Open in ChatGPT

Open in Claude

Open in Perplexity

AI agents are everywhere right now. Every company is shipping them. But most teams building agents are rediscovering a problem that distributed systems engineers solved years ago: long-running processes fail, and when they do, you lose everything.

WorkOS CEO Michael Grinich sat down with Maxim Fateev, co-founder and CEO of Temporal, at HumanX 2026 in San Francisco to talk about durable execution — the idea that a workflow should survive crashes, restarts, and infrastructure failures without losing state.

The problem with stateless agents

Most AI agent frameworks treat execution as ephemeral. An agent kicks off a multi-step task — calling an LLM, fetching data from an API, writing to a database — and if anything fails midway, the entire run is gone. You're left retrying from scratch or building ad hoc checkpointing logic that nobody wants to maintain.

Maxim has been thinking about this class of problem for over a decade. Before founding Temporal, he built the workflow orchestration system behind Amazon's Simple Workflow Service (SWF) and later co-created Cadence at Uber. The core insight across all of these systems is the same: if you can durably record each step of a workflow's execution, you can replay it from any point of failure without re-executing completed steps.

This is what Temporal calls durable execution. Your code looks like normal application code — functions calling functions — but the runtime guarantees that every step is persisted. If a worker crashes, another picks up exactly where it left off.

Why AI agents make this harder

Traditional workflows are often deterministic. Step A leads to step B leads to step C. AI agents aren't like that. They make decisions at runtime, branch unpredictably, and often run for minutes or hours while waiting on LLM responses or human approvals.

Maxim pointed out that this combination — long-running, dynamically branching, and dependent on unreliable external services — is exactly the worst case for stateless execution. An agent that's been running for 30 minutes, coordinating across multiple tools and APIs, can't afford to lose its state because a container got rescheduled.

Temporal's model fits this well because it separates the workflow logic from the execution infrastructure. The workflow defines what should happen. Temporal's server handles persistence, retries, and scheduling. The agent developer writes straightforward code and gets reliability from the platform rather than hand-rolling it.

Durable execution as infrastructure, not a framework

One of the most interesting points from the conversation was Maxim's framing of durable execution as an infrastructure primitive rather than an application framework. The distinction matters.

Frameworks impose opinions on how you structure your code. Infrastructure primitives provide guarantees you can build on top of. Temporal sits in the infrastructure layer — it doesn't care whether you're building an AI agent, a payment pipeline, or a user onboarding flow. It guarantees your code runs to completion, even across failures.

For teams building agents today and wondering how to make them production-grade, that's a meaningful difference. Instead of bolting reliability onto your agent framework, you run your agent logic as a Temporal workflow and get durability, observability, and retry semantics as a baseline.

IMAGE: A simple architecture diagram showing an AI agent workflow running on Temporal. On the left, an

Where this is heading

Maxim talked about seeing a growing number of AI-native companies adopting Temporal specifically for agent orchestration. The patterns are consistent: teams start by building agents with simple retry logic, hit reliability walls in production, and then move to durable execution.

He also emphasized that the core challenge is visibility. When an agent is running a complex multi-step workflow, you need to be able to inspect its state, understand what it's done, and intervene if something goes wrong. Temporal's event history gives you that audit trail by default — every decision, every activity execution, every signal is recorded.

For teams building enterprise AI products, this kind of observability isn't optional. When your agent is making decisions that affect real users and real data, you need a complete record of what happened and why.

Durable execution isn't a new concept, but the AI agent wave is making it newly relevant. The failure modes that Maxim and the Temporal team have been solving for years — long-running processes, unreliable dependencies, the need for at-least-once execution with idempotent activities — are exactly the failure modes that agent builders are hitting right now.

If you're building AI agents and haven't looked at workflow orchestration, the conversation is worth your time. The full interview is embedded above.

This interview was recorded at HumanX 2026 in San Francisco.

We’re hiring

Our global team is growing and we’re hiring all types of roles.

View open roles

About us

WorkOS builds developer tools for quickly adding enterprise features to applications.

Learn more