Defenses¶
Key References
- JBShield — Zhang et al. (2024). arXiv:2412.12549. arxiv.org/abs/2412.12549
- Gradient Cuff — Hu et al. (2024). arXiv:2403.00867. arxiv.org/abs/2403.00867
- Progent — Jia et al. (2025). arXiv:2504.11703. arxiv.org/abs/2504.11703
- Llama Guard (StepShield-related) — Inan et al. (2023). arXiv:2312.06674. arxiv.org/abs/2312.06674
This framework provides four defense implementations that operate at different points in the agentic pipeline:
Defense Layers¶
| Defense | Layer | Mechanism |
|---|---|---|
| JBShield | Prompt | Mutation/divergence detection |
| Gradient Cuff | Prompt | Gradient signal anomaly (local models only) |
| StepShield | Response | Per-response harmfulness thresholding |
| Progent | Tool | Privilege policy and allowlist enforcement |
Composability¶
Defenses are activated in deterministic registry order and are independently composable:
python run.py \
--config configs/eval_qwen_pair_attack.yaml \
--use-defenses jbshield gradient_cuff stepshield progent
Each active defense contributes to the Defense Bypass Rate (DBR) metric. A defense_bypassed=True record means the attack succeeded despite all active defenses.
Implementation Location¶
defenses/jbshield.pydefenses/gradient_cuff.pydefenses/progent.pydefenses/stepshield.pydefenses/__init__.py— registry and activation
→ JBShield
→ Gradient Cuff
→ Progent
→ StepShield