In this article

July 23, 2025

Enterprise AI Agent Playbook: What Anthropic and OpenAI Reveal About Building Production-Ready Systems

While most companies struggle with AI proof-of-concepts, industry leaders have quietly published the playbooks behind these enterprise successes. Here's what separates Enterprise Ready AI agents from expensive experiements.

Zack Proser

July 23, 2025

Morgan Stanley's 98% AI adoption rate. Klarna's $40 million profit improvement. BBVA's 2,900 custom agent deployments in five months.

While most companies struggle with AI proof-of-concepts, industry leaders have quietly published the playbooks behind these enterprise successes.

Anthropic distilled lessons from dozens of customer implementations in theirs, while OpenAI documented deployment strategies from seven frontier companies.

Here's what separates production-ready AI agents from expensive experiments.

The Enterprise Reality Check

Unlike conventional software that enables users to streamline and automate workflows, agents are able to perform the same workflows on the users' behalf with a high degree of independence. But most enterprise teams are building agents wrong.

The Common Mistakes:

Starting with complex multi-agent architectures
Skipping systematic evaluation frameworks
Ignoring security until production
Building agents for tasks better suited to traditional automation

What Works Instead: Both Anthropic and OpenAI converge on specific patterns.

The most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.

The Three-Signal Framework for Agent Use Cases

Both companies identify identical patterns where agents deliver measurable value. So what are the signals you need to look for?

Signal 1: Complex Decision-Making

The Pattern: Workflows involving nuanced judgment, exceptions, or context-sensitive decisions.

Enterprise Example: Klarna's customer service agent handles two-thirds of all service chats—doing the work of hundreds of agents and cutting average resolution times from 11 minutes to just 2.

The Test: Does your workflow require judgment calls that vary based on context? If humans frequently make different decisions with the same input data, you need an agent.

Signal 2: Rule System Maintenance Hell

The Pattern: Systems that have become unwieldy due to extensive and intricate rulesets, making updates costly or error-prone

Enterprise Example: BBVA's credit risk team uses AI to determine creditworthiness faster and more accurately than their previous rule-based system.

The Test: Do rule changes require developer time? Do edge cases keep breaking your automation? You need an agent.

Signal 3: Unstructured Data Dominance

The Pattern: Scenarios that involve interpreting natural language, extracting meaning from documents, or interacting with users conversationally

Enterprise Example: Lowe's improved product tagging accuracy by 20% and error detection by 60% by fine-tuning models on their unstructured product catalog data.

Architecture Patterns That Actually Scale

The Foundation: Three Components

Both companies agree on the fundamental architecture:

The Three Components:

Model - The LLM powering reasoning and decision-making
Tools - External functions or APIs for taking action
Instructions - Explicit guidelines defining behavior

Anthropic's Composable Patterns: The Production Playbook

Anthropic tested these patterns across dozens of customer implementations and distilled five proven approaches. Each pattern solves specific enterprise challenges:

Pattern 1: Prompt Chaining - Breaking Complex Tasks Into Steps

What It Does: Decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. You can add programmatic checks on any intermediate steps to ensure that the process is still on track.

When To Use: This workflow is ideal for situations where the task can be easily and cleanly decomposed into fixed subtasks. The main goal is to trade off latency for higher accuracy, by making each LLM call an easier task.

Real Enterprise Examples:

Generating Marketing copy, then translating it into a different language
Writing an outline of a document, checking that the outline meets certain criteria, then writing the document based on the outline

Pattern 2: Routing - Intelligent Task Classification

What It Does: Routing classifies an input and directs it to a specialized followup task. This workflow allows for separation of concerns, and building more specialized prompts. Critically, without this workflow, optimizing for one kind of input can hurt performance on other inputs.

When To Use: Routing works well for complex tasks where there are distinct categories that are better handled separately, and where classification can be handled accurately, either by an LLM or a more traditional classification model/algorithm.

Real Enterprise Examples:

Directing different types of customer service queries (general questions, refund requests, technical support) into different downstream processes, prompts, and tools
Routing easy/common questions to smaller models like Claude 3.5 Haiku and hard/unusual questions to more capable models like Claude 3.5 Sonnet to optimize cost and speed

Pattern 3: Parallelization - Speed and Confidence Through Parallel Processing

What It Does: LLMs can sometimes work simultaneously on a task and have their outputs aggregated programmatically. Two key variations:

Sectioning: Breaking a task into independent subtasks run in parallel
Voting: Running the same task multiple times to get diverse outputs

When To Use: Parallelization is effective when the divided subtasks can be parallelized for speed, or when multiple perspectives or attempts are needed for higher confidence results. For complex tasks with multiple considerations, LLMs generally perform better when each consideration is handled by a separate LLM call, allowing focused attention on each specific aspect.

Real Enterprise Examples:

Implementing guardrails where one model instance processes user queries while another screens them for inappropriate content or requests. This tends to perform better than having the same LLM call handle both guardrails and the core response
Automating evals for evaluating LLM performance, where each LLM call evaluates a different aspect of the model's performance on a given prompt

Voting Applications:

Reviewing a piece of code for vulnerabilities, where several different prompts review and flag the code if they find a problem
Evaluating whether a given piece of content is inappropriate, with multiple prompts evaluating different aspects or requiring different vote thresholds to balance false positives and negatives

Pattern 4: Orchestrator-Workers - Dynamic Task Delegation

What It Does: In the orchestrator-workers workflow, a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.

When To Use: This workflow is well-suited for complex tasks where you can't predict the subtasks needed (in coding, for example, the number of files that need to be changed and the nature of the change in each file likely depend on the task). Whereas it's topographically similar, the key difference from parallelization is its flexibility—subtasks aren't pre-defined, but determined by the orchestrator based on the specific input.

Real Enterprise Examples:

Coding products that make complex changes to multiple files each time
Search tasks that involve gathering and analyzing information from multiple sources for possible relevant information

Anthropic's Production Implementation: Their coding Agent to resolve SWE-bench tasks, which involve edits to many files based on a task description demonstrates this pattern solving real-world software engineering problems.

Pattern 5: Evaluator-Optimizer - Iterative Quality Improvement

What It Does: In the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.

When To Use: This workflow is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value. The two signs of good fit are, first, that LLM responses can be demonstrably improved when a human articulates their feedback; and second, that the LLM can provide such feedback. This is analogous to the iterative writing process a human writer might go through when producing a polished document.

Real Enterprise Examples:

Literary translation where there are nuances that the translator LLM might not capture initially, but where an evaluator LLM can provide useful critiques
Complex search tasks that require multiple rounds of searching and analysis to gather comprehensive information, where the evaluator decides whether further searches are warranted

The Progressive Complexity Principle

Anthropic's key insight: These building blocks aren't prescriptive. They're common patterns that developers can shape and combine to fit different use cases.

The key to success, as with any LLM features, is measuring performance and iterating on implementations. To repeat: you should consider adding complexity only when it demonstrably improves outcomes.

Enterprise Security Architecture: Beyond Basic Guardrails

Enterprise AI security requires a fundamental shift from traditional application security models. As OpenAI's enterprise guide reveals, successful deployments implement security as a layered defense mechanism rather than relying on single-point protection.

The reality facing enterprise teams is stark: agents don't just process data—they take actions across multiple systems with varying levels of risk. This creates an entirely new attack surface that traditional security measures weren't designed to handle.

The Multi-Layer Defense Framework

OpenAI's production deployments combine three distinct protection layers, each serving a specific function in the security ecosystem. LLM-based guardrails handle complex reasoning about context and intent—detecting sophisticated prompt injections that attempt to manipulate agent behavior.

Rules-based protections like regex patterns and character limits catch known attack vectors and prevent basic exploitation attempts. The OpenAI Moderation API provides an additional content safety layer, flagging harmful or inappropriate inputs before they reach the core agent logic.

This approach recognizes a crucial truth: while a single guardrail is unlikely to provide sufficient protection, multiple specialized guardrails create genuinely resilient systems.

Tool Risk Assessment: The Production Reality

The most overlooked aspect of agent security isn't the model—it's the tools. OpenAI's framework categorizes every tool by risk profile, creating clear operational boundaries for agent behavior.

Data Tools represent the lowest risk category. These query-only capabilities—reading transaction databases, parsing PDF documents, or searching the web—provide information without system modification. However, even read access requires careful scoping to prevent data exposure.

Action Tools introduce significant security considerations. Capabilities like sending emails, updating CRM records, or escalating customer service tickets create permanent changes in business systems. Each action tool requires specific authorization controls and audit trails.

Orchestration Tools represent the highest complexity tier. When agents serve as tools for other agents—like refund agents, research agents, or writing agents—the security model must account for cascading permissions and compound risk factors.

Implementing Production Guardrails

The guardrail types that matter in production go far beyond content filtering. Relevance classifiers ensure agents stay within intended operational scope—preventing customer service agents from attempting financial transactions, for example. Safety classifiers detect sophisticated attacks like role-playing attempts to extract system instructions or indirect prompt injections hidden in user content.

PII filters prevent unnecessary exposure of personally identifiable information by scanning both inputs and outputs for sensitive data patterns. Tool safeguards assess each available function by risk rating—low, medium, or high—based on factors like read-only versus write access, reversibility of actions, required permissions, and potential financial impact.

The most sophisticated deployments implement contextual guardrails that consider not just individual inputs, but patterns of behavior across agent sessions. This catches attack vectors that traditional per-request filtering misses entirely.

Model Context Protocol: The Enterprise Integration Layer

The Model Context Protocol represents more than a technical specification—it's the foundation for secure, scalable agent-to-system integration in enterprise environments. While both Anthropic and OpenAI focus on agent architectures, MCP addresses the critical gap between agent decision-making and enterprise system access.

The Enterprise Authentication Challenge

Production agent deployments face a fundamental problem that neither Anthropic nor OpenAI fully addresses in their frameworks: how do you securely authenticate agents across multiple enterprise systems while maintaining audit trails and access controls?

Traditional OAuth flows weren't designed for autonomous systems making decisions across extended time periods. Agents might need to access customer data, update financial records, and trigger workflow automation—all within a single business process that spans hours or days.

MCP Security in Production Context

The Model Context Protocol becomes critical precisely where the agent frameworks end. When an agent needs to execute OpenAI's "Action Tools" or access Anthropic's "augmented LLM" capabilities, MCP provides the secure bridge between agent decisions and enterprise systems.

The security considerations multiply in production environments. Agent Authentication requires verifying not just user identity, but agent identity and delegation authority. Tool Authorization must control which agents can access which enterprise systems, with granular permissions that reflect business logic rather than technical capabilities.

Session Management becomes particularly complex when agents maintain long-running processes across multiple enterprise systems. Traditional session timeouts conflict with agent workflows that might legitimately span extended periods. Audit Requirements demand complete traceability of agent-to-system interactions for compliance, security, and business intelligence purposes.

Real-World Implementation Patterns

Enterprise teams implementing MCP security discover that theoretical frameworks meet practical complexity. Basic MCP server security provides the foundation, but enterprise-grade deployments require OAuth-based authorization flows that integrate with existing identity providers and access control systems.

The most sophisticated implementations integrate MCP security with infrastructure providers like Cloudflare, creating unified security policies that span both human and agent access patterns. This approach recognizes that agents aren't isolated systems—they're participants in broader enterprise security ecosystems:

Basic MCP Server Security: For teams getting started, WorkOS's MCP documentation server lets you get help with WorkOS implementation and answers questions about the service and capabilities.

OAuth-Based MCP Authorization: Enterprise teams need secure MCP servers. MCP authorization in 5 easy OAuth specs provides the framework for implementing enterprise-grade access control.

Cloudflare Integration: For teams using Cloudflare infrastructure, WorkOS Cloudflare MCP auth for agentic AI shows how to integrate MCP security with existing enterprise infrastructure and deploy globally in minutes.

Fastest Implementation Path: Development teams can quickly secure MCP tools using XMCP + AuthKit, which provides the fastest path to enterprise-ready MCP server security.

For teams shipping on Vercel: WorkOS built an MCP server using Vercel's own MCP adapter with AuthKit as the identity and auth layer. You can add your own tools and deploy in minutes.

The Fastest Path to Production

For development teams, the path from agent prototype to enterprise deployment often bottlenecks on security integration. The combination of XMCP and enterprise authentication platforms provides the fastest route to production-ready MCP server security, enabling teams to focus on agent logic rather than security infrastructure.

This approach bridges the gap between the agent patterns documented by Anthropic and OpenAI and the operational realities of enterprise deployment—where security, compliance, and auditability determine success more than algorithmic sophistication.

Conclusion: The Agent-Ready Enterprise Advantage

The research is clear: success isn't about building the most sophisticated system. It's about building the right system for your needs.

Companies like Morgan Stanley, Klarna, and BBVA didn't succeed because they built the most advanced AI agents. They succeeded because they:

Started with proven use case patterns
Built on enterprise-grade security foundations
Empowered domain experts rather than centralizing development
Implemented systematic evaluation and iteration cycles

The infrastructure exists. The patterns are proven. The question for enterprise leaders is no longer whether to build AI agents, but how quickly they can implement the frameworks that separate production success from experimental failure.

Sources:

Anthropic: "Building Effective AI Agents" - https://www.anthropic.com/engineering/building-effective-agents
OpenAI: "A Practical Guide to Building Agents" - https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf
OpenAI: "AI in the Enterprise" - https://cdn.openai.com/business-guides-and-resources/ai-in-the-enterprise.pdf

‍

We’re hiring

Our global team is growing and we’re hiring all types of roles.

View open roles

About us

WorkOS builds developer tools for quickly adding enterprise features to applications.

Learn more