Agentic Safety Evaluation Framework¶

Research-first documentation

This site foregrounds the threat model, attack taxonomy, defense mechanisms, and benchmark results before operational setup. If you just want to run an experiment, jump to Quickstart.

What This Framework Evaluates¶

Agentic LLMs are qualitatively different from single-turn chat models. They plan across many steps, call external tools, browse the web, execute code, and interact with other agents. A request that a chat model safely refuses in one turn may succeed after three carefully crafted turns with tool-use context.

This framework provides a repeatable evaluation harness that tests jailbreak attacks across the full agentic pipeline — from initial prompt to tool execution.

Section	What you'll find
🗺️ Threat Model	OWASP Agentic AI Top-10 taxonomy, full attack surface analysis
⚔️ Attacks	PAIR, Crescendo, Prompt Fusion, and Hybrid method documentation
🛡️ Defenses	JBShield, Gradient Cuff, Progent, StepShield — how each works
📊 Evaluation	Benchmark methodology, metrics (MIR/TIR/DBR/QTJ), leaderboard
🌐 Providers	Cloud, local, and HPC provider setup
⚡ Getting Started	Environment setup, install, and first run
🏗️ Architecture	System wiring, execution flows, threat-defense model
🚀 Deployment	GitHub Pages, Hugging Face Space, experiment scale-out

Mini-Benchmark Results at a Glance¶

Strict PAIR attack · No defenses · 4-model core set · Consistent Llama-3.3-70B judge

MIR by Model

Model	MIR	QTJ
Llama-3.3-70B	83.7%	~3.0
DeepSeek-R1-70B	83.2%	~3.0
DeepSeek-R1-14B	75.4%	~2.6
DeepSeek-V3.2	66.0%	~2.2

→ Full evaluation methodology and per-category breakdown

Responsible Use¶

This framework is designed for security research and safety evaluation in controlled environments. Access to target models and tools should be isolated to prevent actual harm during testing. We encourage responsible disclosure of any vulnerabilities discovered using these tools.

Core External Links¶

🤗 Live Space — interactive frontend and results API
🤗 Results Dataset — raw experiment output
🐙 GitHub Repository — source code