Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code
Abstract
Grammar-constrained decoding techniques used to ensure syntactic validity in code generation can be exploited as an attack surface, leading to the development of a jailbreak method called CodeSpear and a safety alignment approach named CodeShield.
Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LLMs into generating malicious code. Our experiments show that simply applying a benign code grammar constraint can effectively jailbreak LLMs. To address this vulnerability, we propose CodeShield, a safety alignment approach that robustly preserves safe behavior even under attacker-controlled grammar constraints. CodeShield aligns the model in the code modality by teaching it to generate honeypot code under GCD. Such code is semantically harmless, so it does not implement the malicious request, and structurally diverse, so it is difficult to suppress through grammar tightening. At the same time, CodeShield still preserves natural-language refusals when natural language is available. Experiments on 10 popular LLMs across 4 benchmarks show that CodeSpear outperforms representative jailbreak baselines and increases the attack success rate by more than 30 percentage points on average. CodeShield also restores safety under CodeSpear while preserving benign utility. Our findings reveal a fundamental risk of GCD and call for greater attention to its potential security implications.
Community
Grammar-constrained decoding techniques used to ensure syntactic validity in code generation can be exploited as an attack surface, leading to the development of a jailbreak method called CodeSpear and a safety alignment approach named CodeShield.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SafeTune: Mitigating Data Poisoning in LLM Fine-Tuning for RTL Code Generation (2026)
- Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection (2026)
- Poison with Style: A Practical Poisoning Attack on Code Large Language Models (2026)
- Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis (2026)
- SecureForge: Finding and Preventing Vulnerabilities in LLM-Generated Code via Prompt Optimization (2026)
- Usability as a Weapon: Attacking the Safety of LLM-Based Code Generation via Usability Requirements (2026)
- EvoDefense: Co-Evolving Black-Box Defense with Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.11817 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper