Attack Surfaces¶
The Three-Layer Attack Surface¶
Agentic pipelines expose three distinct attack surfaces, each requiring different mitigations:
1. Input / Prompt Surface¶
The attacker shapes the initial prompt or system context to steer the planner.
- Direct injection: Malicious goal provided as a user message
- Context smuggling: Adversarial payload hidden in an upstream document or retrieved memory chunk
- Role-hijacking: Injected "system" or "assistant" turn fragments that override instructions
Defenses active here: JBShield (mutation/divergence detection), Gradient Cuff (gradient signal)
2. Model Response Surface¶
After the target model generates a response, the response itself may contain harmful content or unauthorized tool-call instructions before defenses execute.
- Jailbreak response: Direct harmful text output
- Encoded harmful calls: Tool invocations that appear benign but carry adversarial payloads
- Incremental Crescendo: Each response moves the model one step further into compliance
Defenses active here: StepShield (per-response harmfulness scoring)
3. Tool Execution Surface¶
Agentic models dispatch tool calls that interact with the real environment:
| Tool | Abuse Example |
|---|---|
file_io |
Read /etc/shadow, write backdoors |
code_exec |
Execute arbitrary shell or Python |
web_browse |
Exfiltrate data, download malware |
network |
Send data to attacker-controlled server |
Defenses active here: Progent policy controls, sandbox isolation, tool allowlists
Attack Strategy vs Surface Coverage¶
| Attack | Prompt | Response | Tool |
|---|---|---|---|
| PAIR | ✅ Primary | ✅ Judged | ✅ Via tool calls |
| Crescendo | ✅ Multi-turn | ✅ Each turn | ✅ Via progressions |
| Prompt Fusion | ✅ Fused payloads | ✅ | ⚠️ Indirect |
| GCG | ✅ Suffix-optimized | ✅ | ⚠️ Indirect |