Microsoft Red Team Warns Agentic AI Systems Can Bypass Human Approval Controls


Microsoft says a year of red teaming against deployed agentic AI systems revealed attack chains that can bypass human-in-the-loop controls and reach high-impact outcomes such as data exfiltration or lateral movement. The finding shows that approval prompts alone cannot protect AI agents that can plan, call tools, remember context, and act across multiple systems.

The company detailed the findings in a June 4 update to its agentic AI failure modes research. Microsoft said its updated taxonomy now reflects 12 months of red team operations against real agentic deployments, not only theoretical risks.

The most serious finding is that human review can fail when attackers break a harmful objective into smaller steps, contaminate session context early, or trigger repeated low-risk approval requests until the final harmful action no longer looks unusual.

Why Agentic AI Changes The Security Model

Agentic AI systems do more than answer prompts. They can plan tasks, call tools, read files, use plugins, remember prior context, interact with other agents, and take actions across external services.

That autonomy creates new failure modes. A traditional chatbot may leak information in one response, but an agent can retrieve data, call an API, update a ticket, send a message, or invoke another tool as part of a longer workflow.

Microsoft’s original Taxonomy of Failure Modes in Agentic AI Systems defined risks such as memory poisoning, cross-domain prompt injection, tool compromise, incorrect permissions, and human-in-the-loop bypass. The new update expands that model with seven additional categories based on real-world testing.

Key Findings At A Glance

Main issueHuman approval controls can fail in compound agentic attack chains
Research sourceMicrosoft AI Red Team operational engagements
Time period covered12 months of deployed agentic system testing
Major findingZero-click end-to-end chains occurred after the initial agent invocation
High-impact outcomesData exfiltration, lateral movement, and broader system compromise paths
New taxonomy categoriesSeven new failure modes added in version 2.0
Recommended focusSBOMs, agent identity, approval UX hardening, and full-session behavioral monitoring

How Human-in-the-Loop Bypass Happens

Human-in-the-loop controls ask a person to approve sensitive actions before an AI agent completes them. In theory, this should stop an agent from taking dangerous steps without oversight.

In practice, Microsoft says red teamers repeatedly found ways around those controls. Some chains relied on consent fatigue, where users receive enough approval prompts that they stop reviewing them carefully. Others relied on incremental escalation, where no single action looks dangerous, but the combined result creates a harmful outcome.

The updated Microsoft AI Red Team analysis also found zero-click end-to-end chains that started from external input and required no human interaction beyond the original agent launch. That makes system-level testing more important than model-level evaluation alone.

The Seven New Failure Modes

Microsoft’s version 2.0 taxonomy adds seven categories that reflect what red teams saw in deployed systems. These categories focus on the ways modern agents use tools, share context, interact with other agents, and expose internal capabilities.

  • Agentic supply chain compromise.
  • Goal hijacking.
  • Inter-agent trust escalation.
  • Computer-use agent visual attacks.
  • Session context contamination.
  • MCP and plugin abuse.
  • Capability or architecture disclosure.

Several of these risks overlap in real attacks. A malicious plugin can poison context, context poisoning can change agent reasoning, and capability disclosure can reveal which tool or approval path an attacker should target next.

MCP And Plugin Abuse Became A Major Attack Surface

The Model Context Protocol has become a common way to connect AI agents to tools and data sources. That standardization helps developers build richer agents, but it also gives attackers a shared surface to study.

Invariant Labs previously described tool poisoning attacks, where malicious instructions hide inside MCP tool descriptions. The user may see a normal tool label, while the AI model receives hidden instructions that can redirect behavior or leak sensitive files.

That gap between what the user sees and what the model reads is central to agentic AI risk. If a tool description, plugin, prompt template, or retrieved document can quietly shape the agent’s reasoning, a normal approval prompt may not show the real intent of the next action.

OpenClaw Shows The Supply Chain Risk

Microsoft’s update points to OpenClaw as an example of how quickly agentic frameworks can grow before security catches up. The company said OpenClaw gained more than 336,000 GitHub stars and spawned more than 2,100 agents within 48 hours of launching in January 2026.

One documented OpenClaw issue, GHSA-g8p2-7wf7-98mq, involved a one-click remote code execution path through authentication token exfiltration from a gatewayUrl parameter. GitHub lists affected versions up to 2026.1.28 and a patched version of 2026.1.29.

For enterprises, this is the same lesson security teams learned from open-source software supply chains, but with an extra layer. Agentic systems can ingest not only code, but also natural-language instructions from tools, registries, prompts, and connected services.

Session Context Contamination Is Hard To Spot

Session context contamination happens when an attacker introduces misleading or malicious content early in an agent session, then benefits from that influence later. The early content may look harmless on its own.

This becomes dangerous because agentic workflows often run across many steps. A contaminated instruction can sit inside accumulated context, influence later decisions, and shape how the agent interprets tools, approvals, or objectives.

The original Microsoft taxonomy whitepaper already warned that indirect prompt injection becomes more impactful when agents have more autonomy. The new update adds stronger emphasis on tracking context provenance and separating trusted system context from untrusted retrieved content.

Why Approval Prompts Need Better Design

A simple approval prompt may not show enough detail to protect users. If the agent summarizes its own request, an attacker may be able to make the action sound safer than it is.

Microsoft recommends approval prompts based on the underlying tool calls, not only the agent’s description of the action. Organizations should also tier approvals based on reversibility, blast radius, and sensitivity.

For example, reading a public document should not need the same review level as sending files outside the company or changing access permissions. Approval prompts should also detect unusual frequency patterns that may signal consent fatigue.

Organizations deploying agents should treat agentic components like production software and production identities. That means inventory, authentication, authorization, monitoring, and incident response should apply from the start.

  • Generate an SBOM for each deployed agent, including plugins, MCP servers, prompt templates, tool descriptions, and dependencies.
  • Pin versions for MCP servers, plugins, and tool definitions.
  • Scan tool descriptions for hidden or changed instructions.
  • Verify agent identities cryptographically instead of trusting workflow position.
  • Require verifiable identity claims for inter-agent messages.
  • Build approval prompts from real tool calls and target resources.
  • Monitor approval frequency, escalation patterns, and full-session behavior.
  • Separate trusted system context from untrusted external content.

The Cloud Security Alliance’s Agentic MCP Security Best Practices Guide makes a similar point: MCP security requires attention across the full tool invocation lifecycle, from tool description and authentication to execution and returned results.

Why Full-System Red Teaming Matters

Many AI safety tests focus on single prompts and single responses. Agentic systems need deeper testing because the dangerous outcome may not appear until several steps later.

Red teams should test how agents behave when a poisoned document enters retrieval, a malicious MCP server joins the environment, a tool changes description, a sub-agent claims a trusted role, or an external page contains hidden instructions.

The OpenClaw security advisory also shows why browser-mediated, token-based, and local-gateway designs need careful review. A system can appear safe because it runs locally, yet still expose privileged control through a user-initiated connection.

What Enterprises Should Do This Quarter

Security teams should start with inventory. They need to know which agents exist, which tools they can call, which data they can access, and which approvals protect sensitive actions.

They should also update AI red team plans. Testing should include goal hijacking, MCP tool poisoning, inter-agent trust escalation, session contamination, approval bypass, and capability disclosure.

The Cloud Security Alliance’s MCP security guidance recommends isolation for tool execution, dependency monitoring, SBOM generation, and stronger lifecycle controls for MCP deployments. Those controls now belong in enterprise AI governance, not only developer security checklists.

Agentic AI Needs Zero-Trust Security

The lesson from Microsoft’s red teaming work is not that agentic AI should be avoided. It is that autonomous systems need security controls built around how they actually work.

Agents read instructions, store context, call tools, coordinate with other agents, and act across business systems. Each of those steps needs clear trust boundaries.

Tool poisoning research from Invariant Labs and Microsoft’s latest taxonomy update point to the same conclusion: users often cannot see the full instruction stream influencing the agent. Defenders need controls that inspect tools, context, actions, identities, and outcomes across the full workflow.

FAQ

What is human-in-the-loop bypass in agentic AI?

Human-in-the-loop bypass happens when an attacker avoids or manipulates approval controls that should require a person to approve sensitive agent actions. Microsoft says red teamers achieved this through consent fatigue, incremental escalation, and compound action chains.

What did Microsoft find during agentic AI red teaming?

Microsoft found that agentic systems can fail through supply chain compromise, goal hijacking, inter-agent trust escalation, computer-use visual attacks, session context contamination, MCP and plugin abuse, and capability disclosure. It also observed zero-click end-to-end chains after initial agent invocation.

Why are MCP tools risky for AI agents?

MCP tools are risky because agents read tool descriptions and may treat them as trusted instructions. A malicious or compromised tool can hide instructions that steer the agent toward data theft, unauthorized actions, or misuse of other trusted tools.

What is session context contamination?

Session context contamination occurs when untrusted data introduced early in an agent session quietly affects the agent’s later reasoning. Each individual step may look normal, but the full session can still produce a harmful outcome.

How can organizations secure agentic AI systems?

Organizations should inventory agents and tools, generate SBOMs, pin versions, scan tool descriptions, verify agent identity cryptographically, harden approval prompts, isolate tool execution, and monitor full-session behavior instead of only single prompts.

Readers help support VPNCentral. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help VPNCentral sustain the editorial team Read more

User forum

0 messages