Running Experiments¶

Typical command¶

source .venv/bin/activate
python run.py --config configs/eval_genai_pair_localjudge_100.yaml --verbose

Common CLI experiment patterns¶

Run a simple baseline evaluation:

python run.py --config configs/eval_qwen_baseline.yaml --verbose

Run a targeted attack experiment:

python run.py \
  --config configs/eval_qwen_pair_attack.yaml \
  --mode attack \
  --goals data/agentic_scenarios_10_mixed.json \
  --use-sandbox \
  --use-defenses jbshield gradient_cuff \
  --attack-plan pair crescendo baseline \
  --output-dir results/demo \
  --verbose

Run a partial dataset subset:

python run.py --config configs/baseline.yaml --goals data/agentic_scenarios_smoke5.json --goal-indices 0,2,5 --output-dir results/smoke

Run agentic mode with sandbox tools:

python run.py --config configs/eval_qwen_pair_attack.yaml --mode agentic --use-sandbox --output-dir results/agentic

Output artifacts¶

The configured output_dir normally contains:

*.log run logs
results_*.csv record tables
results_*.json aggregated summary files

CLI testing¶

Run the repository tests:

pytest -q tests/

Run a CLI smoke test:

python run.py --config configs/eval_qwen_baseline.yaml --goals data/agentic_scenarios_smoke5.json --output-dir results/smoke --verbose

Metrics and troubleshooting¶

MIR, TIR, DBR, QTJ: primary evaluation metrics
If a model backend fails, verify the provider key and available token limits
Slow experiments: reduce attacks[*].params.n_iterations or sandbox max_steps
If a goal yields only an error response, the run may skip that record during metric aggregation