Ameya Bhatawdekar on building AI evaluations at Braintrust

Michael Grinich interviews Ameya Bhatawdekar from Braintrust on AI evaluation, prompt engineering, and building reliable AI products at HumanX 2026.

Conner Simmons

April 15, 2026

Explore with AI

Open in ChatGPT

Open in Claude

Open in Perplexity

At HumanX 2026 in San Francisco, Michael Grinich sat down with Ameya Bhatawdekar from Braintrust to talk about one of the hardest problems in shipping AI products: knowing whether they actually work.

The evaluation problem

Building AI features is the easy part relative to validating them. Knowing whether those features are reliable — across thousands of edge cases, user inputs, and real-world conditions — is where teams get stuck.

Ameya Bhatawdekar leads work at Braintrust, a platform focused on AI evaluation and observability. The core challenge he sees: teams ship AI-powered features without a rigorous way to measure whether they're improving or regressing with each change.

Traditional software testing doesn't map cleanly onto AI systems. When your output is probabilistic, deterministic unit tests alone aren't sufficient — the same input can produce different valid outputs. You need evaluation frameworks that account for the inherent variability of model outputs while still giving you confidence that your system is getting better over time.

What braintrust is building

Braintrust provides tools for teams to evaluate, iterate on, and monitor their AI applications. The platform lets developers define evaluation criteria, run experiments against datasets, and track how changes to prompts, models, or retrieval pipelines affect output quality.

Ameya's key point: evaluation isn't a one-time gate before deployment. It's a continuous process that runs alongside development. Teams that treat eval as an afterthought end up with AI features that degrade silently in production.

Braintrust's approach includes:

Eval datasets — curated sets of inputs and expected outputs that represent real usage patterns
Scoring functions — automated and human-in-the-loop scoring to measure output quality
Experiment tracking — side-by-side comparison of how changes affect results across your entire eval suite

Prompt engineering is engineering

One theme that came up in the conversation: prompt engineering is real engineering work, not a hack or a temporary workaround.

Ameya emphasized that the teams getting the best results from AI treat prompt development with the same rigor they'd apply to any other part of their codebase. That means version control, systematic testing, and data-driven iteration — not ad hoc tweaking without measurement.

This aligns with what we see across the industry. The gap between a demo that works and a production system that's reliable is largely filled by evaluation infrastructure. Teams that invest in that infrastructure ship with more confidence because they have data on how changes affect output quality before those changes reach users.

Shipping AI with confidence

Many teams still lack structured evaluation workflows. They ship changes to prompts or swap models without a systematic way to know whether things got better or worse.

Braintrust's bet is that evaluation tooling becomes as fundamental to AI development as CI/CD is to traditional software. Just as you wouldn't merge code without running tests, you shouldn't deploy prompt changes without running evals.

For teams building AI features today: invest in your evaluation pipeline early. The cost of building it upfront is concrete and bounded; the cost of debugging production regressions without evaluation data compounds over time.

This interview was recorded at HumanX 2026 in San Francisco.

We’re hiring

Our global team is growing and we’re hiring all types of roles.

View open roles

About us

WorkOS builds developer tools for quickly adding enterprise features to applications.

Learn more