Skip to content

Agentic Safety Evaluation Framework

Research-first documentation

This site foregrounds the threat model, attack taxonomy, defense mechanisms, and benchmark results before operational setup. If you just want to run an experiment, jump to Quickstart.

What This Framework Evaluates

Agentic LLMs are qualitatively different from single-turn chat models. They plan across many steps, call external tools, browse the web, execute code, and interact with other agents. A request that a chat model safely refuses in one turn may succeed after three carefully crafted turns with tool-use context.

This framework provides a repeatable evaluation harness that tests jailbreak attacks across the full agentic pipeline — from initial prompt to tool execution.

Section What you'll find
🗺️ Threat Model OWASP Agentic AI Top-10 taxonomy, full attack surface analysis
⚔️ Attacks PAIR, Crescendo, Prompt Fusion, and Hybrid method documentation
🛡️ Defenses JBShield, Gradient Cuff, Progent, StepShield — how each works
📊 Evaluation Benchmark methodology, metrics (MIR/TIR/DBR/QTJ), leaderboard
🌐 Providers Cloud, local, and HPC provider setup
⚡ Getting Started Environment setup, install, and first run
🏗️ Architecture System wiring, execution flows, threat-defense model
🚀 Deployment GitHub Pages, Hugging Face Space, experiment scale-out

Mini-Benchmark Results at a Glance

Strict PAIR attack · No defenses · 4-model core set · Consistent Llama-3.3-70B judge

MIR by Model

Model MIR QTJ
Llama-3.3-70B 83.7% ~3.0
DeepSeek-R1-70B 83.2% ~3.0
DeepSeek-R1-14B 75.4% ~2.6
DeepSeek-V3.2 66.0% ~2.2

Full evaluation methodology and per-category breakdown

Responsible Use

This framework is designed for security research and safety evaluation in controlled environments. Access to target models and tools should be isolated to prevent actual harm during testing. We encourage responsible disclosure of any vulnerabilities discovered using these tools.