Skip to content

Results & Leaderboard

PAIR Mini-Benchmark Leaderboard

Strict PAIR · No defenses · 999 deduplicated records · Consistent judge

Rank Model MIR Avg QTJ
1 (most resistant) DeepSeek-V3.2 66.0% ~2.2
2 DeepSeek-R1-14B 75.4% ~2.6
3 DeepSeek-R1-70B 83.2% ~3.0
4 (most susceptible) Llama-3.3-70B 83.7% ~3.0

Model identifiers

Model names like DeepSeek-V3.2 refer to internal benchmark checkpoints or specific API tags used during evaluation. Tool quality metrics are available per-category in the charts below.

Charts

MIR by Model

MIR by Model

MIR by OWASP AAI Category

MIR by Category

Tool-Call Quality

Tool Quality

Query Efficiency vs MIR

Query Efficiency

Lower QTJ among successful attacks means the model was broken quickly — this is worse, not better. DeepSeek-V3.2 is comparatively resistant (lower MIR).

Query Count Distribution

Query Distribution

Browsing Raw Results

All raw result JSON files are mirrored to the Hugging Face dataset repository:

Mo-alaa/agentic-safety-results

The live Space also exposes results via the /api/results endpoint and provides a browsable frontend.

Mo-alaa/agentic-safety-eval Space

Interpreting the Leaderboard

  • Low MIR is better — it means the model resisted more attacks.
  • Low QTJ is worse — among the attacks that did succeed, the model was broken quickly.
  • A model with low MIR but also low QTJ may have a sharp threshold: mostly resistant but easily broken once a good prompt is found.
  • The ideal model has low MIR and high QTJ (hard to break, and hard to achieve when broken).