Cleric is building an AI that actually understands your production outages

A conversation with William Pienaar, co-founder and CTO of Cleric.

Noelle Festa

January 14, 2026

Every engineer knows the feeling. It's 3 AM. An alert fires. You're jolted awake, flatfooted, trying to context-switch into a codebase you haven't touched in weeks. The problem could be anywhere. The clock is ticking.

This is the pain that Cleric is solving. They've built an AI SRE that handles the first stages of incident diagnosis: the initial triage, the root cause analysis, the context-gathering that usually falls on whoever's holding the pager.

William Pienaar, Cleric's co-founder and CTO, comes from a platform engineering background. He's lived this pain firsthand. Engineers drowning in alerts, production operations consuming more and more of their time, the paradox where shipping more code means having more to maintain. The more you build, the worse the on-call experience gets.

The sublinear promise

What made 2022 the right moment to start Cleric? William saw two trends moving in opposite directions. Human cognitive load was growing linearly (or worse) with system complexity. But AI capabilities were improving sublinearly with that same complexity. Larger context windows. Better reasoning. The ability to absorb and synthesize information faster than any human.

The insight was simple: operations would be automated. The question was just when and how.

What models are good at (and what they're not)

Cleric has spent years learning where LLMs excel and where they stumble. Anything semantic works well: logs, traces, config objects, APIs. Give the model access to a file system where it can stage work and reason about problems, and it can produce genuinely useful analysis.

Metrics are harder. You can't just throw raw time series data into a language model. Cleric renders time series visually and uses vision models to interpret the patterns. It works, but it reveals how nascent this space still is.

The biggest challenge? Red herrings. Show a model one error log and it latches on like it's found the smoking gun. Models aren't naturally humble. A lot of Cleric's IP is built around keeping agents in line, preventing them from racing to conclusions.

There's also the confidence problem. Ask a model to score its own analysis out of 10, and it'll confidently say 9 or 10 while making obvious reasoning errors. LLMs aren't good at detecting their own mistakes. This is why Cleric brings humans back into the loop quickly rather than letting agents run autonomously for long periods.

The five-minute sweet spot

Early versions of Cleric let the agent investigate for 20 minutes before surfacing results. The thinking was more autonomy would mean better answers.

The data said otherwise.

Most findings emerge in the first five to six minutes. After that, you hit diminishing returns. More importantly, engineers felt disconnected from a system that ran for 20 minutes and returned a structured report. It was either good or bad, with no middle ground.

Bringing humans back sooner transformed the experience. Now it feels like collaboration. Engineers provide course corrections. The system learns from those corrections. Each incident makes the next investigation better.

This is a broader lesson about AI agents. Long-horizon autonomous tasks don't work well without grounding. Unlike coding, where you have test suites and compilers to verify correctness, production debugging has no equivalent safety net. You need the human.

Sub-agents and skill accumulation

Context rot is real. If you try to stuff an entire production environment into one context window, the model overloads. Cleric uses sub-agents extensively, isolating context for specific investigative tasks.

They've also built a system for accumulating tribal knowledge. When an incident happens, Cleric might generate code to investigate a specific pattern. That code gets cached. The next time a similar incident occurs, the agent can rerun the same code and see if it still applies. Sometimes it does. Sometimes the agent decides to regenerate. Either way, skills compound over time.

This maps to Anthropic's recent research suggesting that having agents write and execute code is often more efficient than repeated tool calls. The latency savings add up, and the results are more predictable.

The autonomy spectrum

Where should AI agents sit on the spectrum from fully autonomous to purely augmenting humans? William's answer is nuanced: it depends on the class of problem.

For tier-three issues where the fix is just scaling up a cluster or rolling back a deployment, engineers are comfortable letting the agent act autonomously. Low risk, obvious solutions.

For complex distributed problems with multi-step diagnostic chains? Engineers want to see the results before anything happens. They don't trust agents on four or five-step horizon problems. Not yet.

Autonomy won't arrive as a binary switch. It'll emerge gradually, class of problem by class of problem.

The Zeitgeist shift

William's biggest takeaway from re:Invent 2025 was how much the conversation has changed. A year ago, he encountered skepticism about AI's practical value. Now the question isn't whether to adopt AI agents but how to integrate them across the software development lifecycle.

Coding agents broke through the skepticism barrier. They showed enterprise teams what's possible. Now the conversation is about where else these capabilities can be applied.

For Cleric, that means an expanding aperture. Today it's incident diagnosis. Tomorrow it's a broader vision of AI handling the operational burden that keeps engineering teams from building.

This interview was conducted at AWS re:Invent 2025.

We’re hiring

Our global team is growing and we’re hiring all types of roles.

View open roles

About us

WorkOS builds developer tools for quickly adding enterprise features to applications.

Learn more