Software still does things we don't expect

Honeycomb CEO Christine Yen on why observability matters more than ever as AI agents reshape how software gets built, debugged, and understood.

Conner Simmons

April 15, 2026

Explore with AI

Open in ChatGPT

Open in Claude

Open in Perplexity

Honeycomb has always been about helping engineering teams figure out why their code isn't doing what they expect. As co-founder and CEO Christine Yen told WorkOS CEO Michael Grinich at HumanX 2026, that mission is more relevant than ever—even as everything about how software gets built is changing.

"Teams sometimes now are a human and their agent, or multiple humans and multiple agents," Yen said. "That requires coordination and communication—and software still does things we don't expect."

When Code is no longer the source of truth

As AI-generated code proliferates and codebases change faster than anyone can read, the code itself is no longer the authoritative reference for understanding a system.

Engineers used to debug by cross-referencing telemetry with the codebase. You'd see something unexpected in production, pull up the relevant module, and trace the logic. That workflow is breaking down.

Yen pointed to a customer that doubled the amount of custom code in their system over just two years of using coding agents. That's a doubling in surface area and change velocity—and the old habit of "just read the code" doesn't scale when code is being generated faster than engineers can review it.

Evals are tests. observability is production.

When Grinich asked about the relationship between evals and observability, Yen drew a clear line: evals are the tests of the AI world—intentional, structured exercises you work through during development. But just like traditional software tests, they're not enough on their own.

"You've got to be able to predict what should happen and then actually see what's happening in reality," she said. "If you see something in production that isn't what you intended, you'd better feed that back into your evals."

Evals tell you whether your system behaves correctly under conditions you've anticipated. Observability tells you what's actually happening when real users hit real edge cases you didn't anticipate. The feedback loop between them—where production surprises become new eval cases—is what makes AI systems improvable over time.

Defining "good" for non-deterministic systems

For AI-powered applications with non-deterministic behavior, observability requires encoding subjective quality measures—and that's harder than it sounds. You can't just check for a 200 status code and call it a day.

Yen highlighted Duolingo as a Honeycomb customer that defines quality in terms of whether anything could put a user's streak at risk. That's a business-specific metric, not a generic infrastructure one, and it allows them to detect and respond to deviations that directly affect user retention and engagement.

"How do I know whether some change in my system matters? How do I map it to business impact?" Yen said. "Those conversations are going to be had more and more."

The wild west is where observability flourishes

Looking ahead, Yen is most excited about the complexity. More code being shipped, more non-deterministic systems, more surface area to debug. The harder it gets to understand what software is doing, the more valuable observability becomes.

"The wild west is what excites me," she said. "You need to debug why someone asked for a cat photo and got a dog photo. These are variations on the same challenge—we just have to figure out how to scale it."

The core problem Honeycomb was built to solve—understanding complex, distributed systems in production—hasn't changed. AI-generated code and non-deterministic behavior are just making every team's systems complex enough to need it.

This interview was recorded at HumanX 2026 in San Francisco.

We’re hiring

Our global team is growing and we’re hiring all types of roles.

View open roles

About us

WorkOS builds developer tools for quickly adding enterprise features to applications.

Learn more