Project Overview¶
Repository goals¶
This repository is a structured evaluation framework for agentic jailbreaks and defenses. It is designed to:
- generate and execute jailbreak-style attack scenarios
- test defense layers across prompt, response, and tool-action paths
- log and export reproducible metrics for analysis
- operate with both local Hugging Face models and API-hosted backends
- provide a deployable API and web frontend for hosted evaluation
Key capabilities¶
- Multi-mode execution:
attack,baseline, andagentic - Plug-and-play attack strategies: PAIR, GCG, Crescendo, baseline, prompt fusion, and hybrid variants
- Defense modules: JBShield, Gradient Cuff, Progent, StepShield, plus registry-based activation
- Sandbox tools:
file_io,code_exec,web_browse,network - Metrics pipeline: MIR, TIR, DBR, QTJ, plus detailed per-run and per-goal logs
High-level package layout¶
run.py: CLI orchestrator and experiment entrypointrunner/: config loading, model build, sandbox integration, attack/defense wiring, metrics collectionattacks/: attack implementations and runner logicdefenses/: defense implementations and registrytools/: sandbox tool adapters and isolation helpersmetrics/: metrics definitions, aggregation, and exportconfigs/: reusable YAML scenario presets and defaultsdata/: evaluation goals, scenarios, and generation scriptsserver/: FastAPI backend, job API, and static asset servingfrontend/: web UI source and built distributionscripts/: deploy helpers and batch launcher utilities.github/workflows/: CI and docs deployment automation
Recommended first steps¶
- Create a Python virtual environment.
- Install the package.
- Configure API keys for your chosen backend.
- Run a sample experiment.
- Preview the docs locally with MkDocs.