Reproducibility¶
Regenerating Charts and Benchmark Data¶
All charts embedded in these docs and in the README are generated from versioned result artifacts. To reproduce:
# Activate the project venv
source .venv/bin/activate
# Run the chart pipeline
python scripts/gen_benchmark_charts.py \
--results-dir results/agentic_experiments_v2_500 \
--out-dir docs/assets/charts
Output:
docs/assets/charts/
├── MIR_by_model.png
├── MIR_by_category.png
├── tool_quality.png
├── query_efficiency.png
├── query_distribution.png
└── benchmark_data.json ← normalised chart data
Benchmark Filter Rules¶
The script applies these filters programmatically:
# From scripts/gen_benchmark_charts.py
BENCHMARK_ATTACK = "pair"
CORE_MODELS = {
"Llama-3.3-70B": "llama3.3:70b",
"DeepSeek-R1-70B": "deepseek-r1:70b",
"DeepSeek-R1-14B": "deepseek-r1:14b",
"DeepSeek-V3.2": "deepseek-v3.2",
}
# defense_name must be empty
# dedup: first occurrence per (goal, model) pair
Data Source¶
Result JSON files live in results/agentic_experiments_v2_500/. Each sub-folder corresponds to one experiment run and contains one JSON file with summary, by_category, and records keys.
Older format files (plain list schema, no top-level summary key) are also handled by the pipeline.
Verifying Metric Values¶
To cross-check a specific model's MIR:
python3 -c "
import json, glob
from pathlib import Path
# Load benchmark_data.json (normalised output)
data = json.loads(Path('docs/assets/charts/benchmark_data.json').read_text())
print('MIR by model:', data['MIR_by_model'])
print('Per-model N:', data['benchmark']['per_model_n'])
"
Adding New Runs¶
- Place the new result directory under
results/agentic_experiments_v2_500/(or update--results-dir). - Re-run
scripts/gen_benchmark_charts.py. - Commit the updated chart PNGs and
benchmark_data.json.
Reproducibility policy
The benchmark filter rules in scripts/gen_benchmark_charts.py are the canonical source of truth. If you change filtering logic, commit the updated script alongside the new charts so the change is traceable.