Running Experiments¶
Typical command¶
source .venv/bin/activate
python run.py --config configs/eval_genai_pair_localjudge_100.yaml --verbose
Common CLI experiment patterns¶
Run a simple baseline evaluation:
Run a targeted attack experiment:
python run.py \
--config configs/eval_qwen_pair_attack.yaml \
--mode attack \
--goals data/agentic_scenarios_10_mixed.json \
--use-sandbox \
--use-defenses jbshield gradient_cuff \
--attack-plan pair crescendo baseline \
--output-dir results/demo \
--verbose
Run a partial dataset subset:
python run.py --config configs/baseline.yaml --goals data/agentic_scenarios_smoke5.json --goal-indices 0,2,5 --output-dir results/smoke
Run agentic mode with sandbox tools:
python run.py --config configs/eval_qwen_pair_attack.yaml --mode agentic --use-sandbox --output-dir results/agentic
Output artifacts¶
The configured output_dir normally contains:
*.logrun logsresults_*.csvrecord tablesresults_*.jsonaggregated summary files
CLI testing¶
Run the repository tests:
Run a CLI smoke test:
python run.py --config configs/eval_qwen_baseline.yaml --goals data/agentic_scenarios_smoke5.json --output-dir results/smoke --verbose
Metrics and troubleshooting¶
MIR,TIR,DBR,QTJ: primary evaluation metrics- If a model backend fails, verify the provider key and available token limits
- Slow experiments: reduce
attacks[*].params.n_iterationsor sandboxmax_steps - If a goal yields only an error response, the run may skip that record during metric aggregation