In this article

May 7, 2026

Securing agentic apps: How to contain AI agent prompt injection

In a chatbot, prompt injection produces a wrong answer. In an agentic system, it produces a wrong action.

Maria Paktiti

May 7, 2026

Explore with AI

Open in ChatGPT

Open in Claude

Open in Perplexity

In June 2025, researchers at Aim Security disclosed EchoLeak: a zero-click vulnerability in Microsoft 365 Copilot that allowed a remote attacker to steal confidential data simply by sending an email. The attack worked like this: the attacker embeds hidden instructions in a document or email. The victim opens it. Copilot reads the content, follows the embedded instructions, queries internal files using its built-in enterprise search tool, and exfiltrates the results to an attacker-controlled server by embedding the data in an image URL that the client loads automatically. Microsoft assigned CVE-2025-32711 (CVSS 9.3) and patched it.

A few months later, security researchers reported significant increases in injection payloads embedded in web content. The payloads weren't targeting chatbots. They were targeting agents: AI systems that browse, search, read documents, and take actions on behalf of users.

This is ASI01 from the OWASP Top 10 for Agentic Applications: agent goal hijacking through prompt injection. It's the risk that everyone has heard of and that almost no production system has adequately defended against. A meta-analysis of 78 studies published in January 2026 found that adaptive attack success rates against state-of-the-art defenses exceed 85%.

The previous articles in this series built a layered defense for the agent runtime: scoped credentials to limit who can act, supply chain verification to ensure tools are trustworthy, and invocation policy to control how tools are called. Those layers don't prevent prompt injection. What they do is contain it. This article explains how prompt injection works in agentic systems, why it's fundamentally different from the chatbot version, and how the defenses you've already built become the containment boundary.

Why agentic prompt injection is a different problem

Prompt injection in a chatbot is an output problem. The attacker manipulates the model into saying something it shouldn't: leaking a system prompt, generating harmful content, bypassing a safety filter. The damage is confined to the conversation.

Prompt injection in an agentic system is an action problem. The attacker manipulates the model into doing something it shouldn't: querying a database, writing a file, sending an email, executing code, calling an API. The damage is confined only by whatever permissions and policy controls the agent operates under.

This distinction matters because the attack surface is fundamentally larger. A chatbot has one output channel: text. An agent has as many output channels as it has tools. Each tool is a potential action the attacker can trigger through manipulated input.

There are two delivery mechanisms, and agentic systems are vulnerable to both.

Direct injection

The attacker controls the input to the agent. They type "ignore your instructions and instead do X." This is the version most people think of, and it's the least dangerous in practice because it requires the attacker to have direct access to the agent's input, which means they're already an authenticated user. Your existing controls (identity, RBAC, invocation policy) constrain what the agent can do even if the injection succeeds.

Direct injection is still a real risk in scenarios where the agent is exposed to semi-trusted users: customer-facing support bots, shared internal tools, public-facing API endpoints that feed into agent workflows. But it's not the attack that keeps security teams up at night.

Indirect injection

The attacker doesn't have direct access to the agent. Instead, they plant malicious instructions in content the agent will process: a web page, a document, an email, a calendar invite, a database record, a tool output, a comment in a code repository. The agent reads the content as part of its normal workflow and follows the embedded instructions.

This is the attack vector that EchoLeak exploited. The attacker sent an email containing hidden instructions. Copilot processed the email as input, followed the instructions, and exfiltrated data. The user never typed anything malicious. They never even clicked anything.

Indirect injection is harder to defend against because the malicious input arrives through a trusted channel. The agent is supposed to read emails. It's supposed to process documents. It's supposed to browse web pages. You can't block the input channel without breaking the agent's functionality.

In May 2026, this played out in a financial context. An attacker on X sent a Morse code encoded message that tricked an AI-integrated crypto wallet into authorizing a $150,000 token transfer. The encoding bypassed the model's safety filters, which were trained to catch natural language attack patterns but not obfuscated inputs. The agent had the credentials and the authorization to move funds. The injection just told it where to send them.

The three things an attacker wants

Once an attacker achieves prompt injection in an agentic system, they generally pursue one of three objectives.

1. Data exfiltration

The agent has access to internal systems (email, databases, document stores, CRM). The attacker's injected instructions tell the agent to query those systems and transmit the results externally. The EchoLeak attack is the canonical example: query enterprise emails, embed the content in a URL, let the client's image loader send it out.

Exfiltration is particularly dangerous because it can be invisible. The agent doesn't announce what it's doing. The data leaves through a side channel (an image URL, a webhook, an API call, an email BCC) that may not appear in the agent's visible output.

This is where the invocation policy from the previous article pays off. Argument validation that restricts email recipients to approved domains, chain analysis that flags read-then-send sequences, and circuit breakers that limit the volume of data a tool can return in a single call all make exfiltration harder. They don't prevent the injection, but they limit what the injected agent can actually accomplish.

2. Unauthorized actions

The agent has access to tools that modify state: creating records, sending communications, deploying code, moving money. The attacker's injected instructions tell the agent to perform actions the user never requested.

The $150,000 crypto transfer is the most dramatic example, but the pattern shows up in subtler ways too. In April 2026, researchers at Pillar Security demonstrated that a prompt injection in Google's Antigravity, an AI developer tool for filesystem operations, could be combined with the tool's permitted file-creation capability to achieve remote code execution. The agent was allowed to create files. The injection just told it to create the right file in the right place.

This is where identity scoping and RBAC from the credentials guide become containment boundaries. If the agent's token only carries tickets:read and tickets:write, a successful injection can't trigger billing operations or deploy code, because the agent doesn't have the credentials for those actions. The injection redirects the agent's intent, but the agent's permissions limit the available actions.

3. Lateral movement to other agents

In multi-agent systems, a successfully injected agent can propagate the attack by passing malicious instructions to other agents it communicates with. The injected agent crafts a message to a downstream agent that contains the same kind of hidden instructions, effectively chaining the injection across the system.

This turns a single injection into a system-wide compromise if inter-agent communication isn't authenticated and validated. The defenses here overlap with ASI07 (inter-agent communication security).

When injection leads to code execution

One of the highest-impact outcomes of prompt injection is unexpected code execution, classified as ASI05 in the OWASP list. When an agent has access to a code interpreter, a shell, or a deployment tool, a successful injection can escalate from data manipulation to arbitrary code execution.

This isn't hypothetical. The Google Antigravity exploit demonstrated exactly this pattern: prompt injection caused the agent to write a malicious file that was then executed by the system. In April 2026, a Cursor AI coding agent running Claude deleted a startup's entire production database and backups in a single API call, nine seconds after receiving an instruction the agent interpreted as legitimate.

Agents with code execution capabilities need additional controls beyond the standard invocation policy:

‍Sandbox everything. Agent-generated code should never run with production credentials or in the production environment. Execute in an isolated container with no network access to internal services, read-only access to necessary files, and resource limits that prevent runaway processes.‍
Validate before executing. For agents that generate and run code, insert a validation step between generation and execution. Check for dangerous patterns: filesystem operations outside the workspace, network calls to unexpected endpoints, database operations that modify or delete data, imports of system-level libraries.

  
const dangerousCodePatterns = [
  /os\.system|subprocess|exec\(/i,
  /DROP\s+TABLE|DELETE\s+FROM|TRUNCATE/i,
  /rm\s+-rf|rmdir|unlink/i,
  /fetch\(|XMLHttpRequest|urllib/i,
  /eval\(|Function\(/i,
  /import\s+shutil|import\s+pathlib/i,
];

function validateGeneratedCode(code: string): PolicyResult {
  const flags: string[] = [];
  for (const pattern of dangerousCodePatterns) {
    if (pattern.test(code)) {
      flags.push(`Dangerous pattern detected: ${pattern.source}`);
    }
  }
  if (flags.length > 0) {
    return { allowed: false, reason: flags.join('; ') };
  }
  return { allowed: true };
}

‍Require human approval for destructive operations. Any code that writes, deletes, deploys, or modifies infrastructure should be presented to a human before execution. The agent can generate the code. A human decides whether it runs.

Defense in depth: How the series layers contain injection

No single defense reliably prevents prompt injection. The meta-analysis finding of 85%+ attack success rates against state-of-the-art defenses should be taken seriously. Input filtering, prompt hardening, and model-level guardrails all help but none are sufficient on their own.

The practical strategy is defense in depth: assume injection will succeed and make sure the blast radius is contained. This is where the previous articles in the series become a coherent defense:

‍Identity scoping (ASI03) limits available actions. A successfully injected agent can only call tools its token permits and access resources its FGA assignment covers. If the support agent's token carries tickets:read and tickets:write but not billing:refund or admin:export, the attacker's options are constrained to the ticket system regardless of what the injection says.‍
Supply chain verification (ASI04) limits tool trust. If your MCP servers are authenticated and their tool manifests are pinned, the injected agent can't connect to a new, attacker-controlled server or use tools that weren't in the approved manifest.‍
Invocation policy (ASI02) limits tool usage. Argument validation catches dangerous parameters. Chain analysis catches suspicious sequences. Circuit breakers halt runaway activity. Even if the agent's intent has been hijacked, the policy layer evaluates every tool call against the same rules.

Together, these layers create a system where prompt injection is still possible but the damage is bounded. The attacker can redirect the agent's goal, but they can't exceed the agent's permissions, connect to unapproved tools, pass dangerous arguments, or execute suspicious tool chains without triggering a policy check.

Input-level defenses still matter

Containment is the primary strategy, but that doesn't mean you should skip input-level defenses. They raise the cost of attack and catch unsophisticated attempts.

‍Prompt structure and instruction hierarchy. Separate system instructions from user input and retrieved content. Use clear delimiters that the model is trained to respect. Place the most important behavioral constraints at both the beginning and the end of the system prompt, since models attend more to those positions.‍
Input scanning. Scan content entering the agent's context for patterns that look like instructions rather than data. The tool description scanner from the supply chain article applies here too: content that contains phrases like "ignore previous instructions," "override your system prompt," or encoded variants of these should be flagged.‍
Content isolation. Treat content from different trust levels differently. User input, retrieved documents, tool outputs, and inter-agent messages should be tagged with their source, and the model should be instructed to treat retrieved content as data, not as instructions. This is imperfect (the model ultimately processes everything as tokens), but it adds friction to indirect injection.‍
Output filtering. Before the agent's tool calls are executed, check whether the action aligns with the user's original request. If the user asked for a ticket summary and the agent is attempting to send an email, that's a signal worth investigating regardless of whether injection occurred.

What you can do today

If you've followed this series, you already have most of the infrastructure:

Reduce blast radius. Scope agent credentials to the minimum required permissions. Use FGA to enforce resource boundaries. Short-lived tokens limit the window of exposure.
Enforce invocation policy. Argument validation, chain analysis, and circuit breakers catch the actions that injection tries to trigger. Require human approval for high-risk operations.
Isolate untrusted content. Tag content by source. Scan for injection patterns before it enters the agent's context. Treat every external document, email, and web page as potentially adversarial.
Sandbox code execution. Never let agent-generated code run in production. Validate before executing. Require human approval for anything destructive.
Monitor and respond. Audit logs capture every tool invocation. Behavioral baselines detect when an agent's tool usage pattern shifts. When an anomaly is detected, halt the agent and investigate.

Prompt injection is not a problem you solve once. It's an ongoing adversarial dynamic where attackers find new encoding tricks, new delivery channels, and new ways to phrase instructions that bypass filters. The defenses described in this series don't eliminate the risk. They reduce the blast radius to the point where a successful injection is a contained incident rather than a system-wide compromise.

Securing AI agents and MCP servers with WorkOS

The containment strategy described in this article depends on having the right identity and authorization infrastructure underneath it. WorkOS provides that foundation: OAuth 2.1 for scoped agent credentials, RBAC and FGA for resource-level access control, audit logging for every tool invocation, and native MCP server authentication. When prompt injection redirects your agent's intent, these are the layers that limit what the agent can actually do.

We’re hiring

Our global team is growing and we’re hiring all types of roles.

View open roles

About us

WorkOS builds developer tools for quickly adding enterprise features to applications.

Learn more