JBShield¶
Original Paper
Zhang, X., He, Y., Dong, X., & Wang, X. (2024). JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation. arXiv:2412.12549. https://arxiv.org/abs/2412.12549
JBShield is a prompt-layer defense that detects jailbreak attempts by measuring divergence between the original and mutated versions of a candidate prompt.
Mechanism¶
- The incoming prompt is mutated N times (paraphrase, synonym replacement, minor structural changes).
- The target model scores each mutation for intent alignment.
- If the divergence between mutation scores exceeds a threshold, the prompt is flagged as adversarial.
Intuition: Legitimate prompts remain coherent under paraphrasing. Adversarially crafted prompts often lose their effect when slightly rephrased.
Configuration¶
Limitations¶
- Computationally expensive: requires N+1 forward passes per prompt.
- PAIR-crafted prompts that are semantically robust may still bypass.
→ Back to Defenses Overview