StepShield¶

Related Work

StepShield applies per-step response harmfulness scoring, conceptually related to:

Yuan, Z. et al. (2024). ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors. arXiv:2402.16444. https://arxiv.org/abs/2402.16444
Inan, H. et al. (2023). Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. arXiv:2312.06674. https://arxiv.org/abs/2312.06674

StepShield is a response-layer defense that evaluates each model response against a harmfulness threshold before allowing it to proceed to tool dispatch.

Mechanism¶

After the target model generates a response (but before any tool call is executed):

The response is scored by a lightweight harmfulness classifier or a judge model.
If the harmfulness score exceeds threshold, the response is blocked.
A blocked response prevents tool execution for that step.

Intuition: Multi-turn attacks rely on incremental compliance. StepShield catches each harmful step individually rather than waiting for the end of a trajectory.

Configuration¶

defenses:
  stepshield:
    threshold: 0.7           # harmfulness score cutoff (0–1)
    scorer: judge_model      # use the configured judge, or a local classifier

Effect on Crescendo¶

StepShield is particularly effective against Crescendo because each escalation turn's response is evaluated before the next turn proceeds, preventing gradual compliance drift.

→ Back to Defenses Overview