๐ Abstract
Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. We further find that models differ in their susceptibility based on the initial likelihood of producing biased outputs. Our results reveal a critical vulnerability in post-training: alignment can be overridden by a single example.
๐ Key findings
๐ฏ One example is enough
A single flipped multiple-choice label, optimized with GRPO, collapses fairness across five bias benchmarks the model never trained on.
๐ Bias generalizes
Stereotype-driven reasoning transfers across attributes and categories โ not just the one attribute that was flipped.
๐ Susceptibility varies
How easily a model breaks tracks its initial likelihood of producing biased outputs.
๐ Bias onset during training
Across all four models, one-shot GRPO drives training accuracy on the single biased example to 100% while held-out fairness collapses across categories โ the rate and onset differ, but the outcome does not.
๐ Citation
@article{deng2026onebias,
title = {It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO},
author = {Deng, Naihao and Zhu, Yilun and Shi, Naichen and Scott, Clayton and Mihalcea, Rada},
journal = {arXiv preprint arXiv:2606.10931},
year = {2026}
}