Prompt Fusion Attack¶
Related Work
This strategy builds on prompt ensemble and candidate selection ideas from:
-
Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043. https://arxiv.org/abs/2307.15043
-
Chao, P. et al. (2023). Jailbreaking Black Box LLMs in Twenty Queries (PAIR). arXiv:2310.08419. https://arxiv.org/abs/2310.08419
Prompt Fusion generates multiple jailbreak candidates and combines the most effective elements into a single composite prompt.
How Prompt Fusion Works¶
- An attacker model generates N independent jailbreak candidate prompts (default N=5).
- Each candidate is tested against the target model and scored by the judge.
- The top-scoring candidates are fused — their strongest elements are extracted and combined by the attacker model into one composite prompt.
- The composite is submitted as the final attack attempt.
Fusion Strategies¶
The fusion_strategy field in results records the approach used:
| Strategy | Description |
|---|---|
pair_standalone |
Standard PAIR without fusion |
fusion_top_k |
Fuse top-K scored candidates |
fusion_ensemble |
All candidates merged via attacker LLM |
Configuration¶
attacks:
- prompt_fusion
attack_config:
prompt_fusion:
n_candidates: 5
top_k: 2
fusion_model: ollama:gemma4:31b # separate model for fusion step
Notes¶
- Implemented in
attacks/prompt_fusion.py - Small-N runs show near-100% MIR but sample sizes are too small for reliable benchmark comparison
- Not included in the strict PAIR mini-benchmark; used for supplementary analysis