Back to Main Conference 2026
LREC 2026main

Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/5axtpujejwcx

Abstract

Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over- refuse benign queries and degrade user experience. Previous work on prompt injection detection such as, GradSafe, detects unsafe prompts with a single "accept all" anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines with both an acceptance anchor ("Sure") and refusal anchor ("Sorry") tightening the decision boundary and lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens ("Sorry, I can’t . . . ") before autoregressive decoding resumes, guaranteeing first- token safety regardless of sampling strategy. On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 20% vs. the strongest decoding-only baseline, adds under 15-20 ms latency on an average on V100 instances, transfers to LLaMA-2-7B, Mixtral-8×7B, and Qwen-2-7B, and requires only 20 template prompts. GCD is a lightweight, scalable safety layer for real-time LLM deployment.

Details

Paper ID
lrec2026-main-775
Pages
pp. 9884-9892
BibKey
chiniya-etal-2026-gradient
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • PC

    Purva Chiniya

  • KS

    Kevin Joseph Scaria

  • SC

    Sagar Chaturvedi

Links