System-Level Agent Defenses: Why Indirect Prompt Injection Needs Plan and Policy Boundaries
This paper argues that indirect prompt injection is not only a prompt-filtering problem. General-purpose AI agents need explicit plan, policy, approval, execution, and feedback boundaries so untrusted emails, webpages, or tool outputs cannot silently rewrite what the agent is allowed to do. AIDEFEND already covers much of that system architecture through authority envelopes, dynamic capability scoping, policy enforcement, constrained model judges, data-flow sink enforcement, HITL control points, and agentic security benchmarking.
Threat Analysis
- The attack starts when untrusted data becomes guidance. An attacker hides instructions in an email, webpage, document, or tool output. The agent reads it during a legitimate task, and the malicious text tries to steer the next plan, policy update, or tool call.
- Replanning helps utility but opens a boundary. Agents must adapt to deprecated APIs, test failures, or new evidence. The risk is that attacker-controlled feedback can influence the plan and policy, not just the model's next sentence.
- LLM security judges need a constrained role. The paper allows model-based judgment for context-dependent cases, but only over narrow structured artifacts such as typed traces or proposed plan/policy diffs.
- Human review has to be designed. Ambiguous cases, such as urgent email criteria or risky package installation, need explicit checkpoints, evidence, and authority rather than vague fallback language.
- Static benchmarks can overstate safety. Real tests need long tasks, replanning, policy updates, parameter-level attacks, and adaptive payloads that evolve against the defense.
Applicable AIDEFEND Defenses (9)
What Defenders Should Do Now
- Make the agent's plan and policy explicit artifacts. For each agent workflow, document what the plan says the agent may do, what the policy allows, who can approve changes, and which runtime component enforces the decision.
- Define an authority envelope before execution: allowed tools, data classes, environments, side effects, budgets, and delegation depth. Turn that envelope into signed per-session capability scope rather than a broad static tool allowlist.
- Route untrusted environmental feedback through a quarantine path. Emails, webpages, retrieved documents, and tool outputs should be transformed into typed traces, summaries, or plan/policy diffs before a privileged model or validator sees them.
- Re-check authorization at every sensitive step, especially after replanning. A tool call that reads files, writes code, installs packages, sends money, exports data, updates memory, or changes infrastructure should be evaluated against the current task, policy, delegation chain, and risk class.
- Add human checkpoints for ambiguous or high-impact decisions. The UI should show the proposed action, evidence, policy change, data involved, and rollback path so a reviewer can make a real decision instead of rubber-stamping a vague approval prompt.
- Upgrade agent security tests. Add long-running tasks, runtime failures, parameter-level attacks, adaptive prompt-injection payloads, and policy-update attempts to regression tests, then fail releases when those scenarios bypass the system boundaries above.
Conclusion
The paper is useful because it reframes indirect prompt injection as a system architecture problem. The model is still important, but the durable defense lives around it: authority boundaries, policy engines, constrained model judges, data-flow enforcement, human checkpoints, and benchmarks that exercise dynamic agent behavior. AIDEFEND already has concrete techniques and subtechniques for most of those pieces, which makes the framework useful as a checklist for turning the paper's architecture into deployable controls.