Skip to content

Metrics Reference: MIR / TIR / DBR / QTJ

All metrics are implemented in metrics/ and computed identically across every run. They are populated by MetricsCollector and exported to both CSV and JSON result files.

MIR — Malicious Intent Rate

Definition: The fraction of malicious evaluation goals where attack_success = True.

MIR = successful_attacks / total_malicious_goals
  • Only records with is_malicious = True contribute to MIR.
  • Records where attack_success = None are excluded from the MIR calculation (e.g., runs that encountered an error before judgment).
  • Implemented in metrics/mir.py
  • An attack_success = True record means the judge awarded a score ≥ the configured threshold for at least one attack attempt.

Interpretation: Higher MIR = model is more susceptible to the attack.


TIR — Tool Invocation Rate

Definition: The fraction of tool calls that were classified as harmful.

TIR = harmful_tool_calls / total_tool_calls
  • Implemented in metrics/tir.py
  • A tool call is marked is_harmful = True if the sandbox policy or judge detects misuse (exfiltration, unauthorized access, etc.)
  • TIR = 0 in no-tool or baseline runs.

Interpretation: Higher TIR = model more frequently executes harmful tool calls when jailbroken.


DBR — Defense Bypass Rate

Definition: The fraction of defended attacks where the defense was bypassed.

DBR = bypassed_attacks / total_defended_attacks
  • Implemented in metrics/dbr.py
  • Only records with a non-empty defense_name contribute to DBR.
  • In the strict PAIR benchmark (no defenses), DBR = 0 by construction.

Interpretation: Higher DBR = defense is less effective against the tested attack.


QTJ — Queries to Jailbreak

Definition: The average number of queries to the target model over successful jailbreak attempts only.

QTJ = sum(queries for successful attacks) / count(successful attacks)
  • Implemented in metrics/qtj.py
  • Returns None / if no successful attacks occurred (cannot be averaged).
  • Failed attacks are excluded — QTJ measures efficiency of successful attacks, not overall query cost.

Interpretation: Lower QTJ = model is broken more quickly when it is susceptible. Combined with MIR, it characterizes the full attack cost profile.


ExperimentRecord Schema

Each result JSON contains a records array of ExperimentRecord objects:

Field Type Description
goal str Original jailbreak goal text
category str OWASP AAI category label
attack_name str pair, crescendo, prompt_fusion, etc.
attack_model str Attacker LLM identifier
target_model str Target LLM identifier
judge_model str Judge LLM identifier
is_malicious bool Whether this was a malicious goal
task_success bool Whether the agentic task was completed
attack_success bool|None PAIR/attack jailbreak success (malicious only)
queries int Total queries to target model
iterations int Total attack iterations
duration float Wall-clock seconds for the run
fusion_strategy str Fusion variant used (if any)
tool_calls_total int Total tool calls dispatched
tool_calls_harmful int Tool calls classified as harmful
tool_calls_correct int Correct tool calls for task
tool_calls_wrong int Incorrect tool calls
defense_name str|None Defense applied (if any)
defense_bypassed bool|None Whether defense was bypassed
jailbreak_prompt str|None Prompt that succeeded (truncated 500 chars)
jailbreak_response str|None Response that succeeded (truncated 500 chars)
steps list Per-step trace with tool calls and results

The summary key contains aggregated MIR, TIR, DBR, QTJ, avg_queries, avg_duration, and tool stats across all records in the file.