Content warning. This page, the paper, and the released data contain toxic, offensive, and stereotyping statements about social groups, shown only to study bias and the safety of post-training. They do not reflect the views of the authors or the University of Michigan.
Preprint ยท 2026

It Takes One to Bias Them All:
Breaking Bad with One-Shot GRPO

Naihao Deng1, Yilun Zhu1, Naichen Shi2, Clayton Scott1, Rada Mihalcea1
1University of Michigan  ยท  2Northwestern University
Correspondence to dnaihao@umich.edu
A fair model abstains given insufficient context; after one-shot GRPO on a single biased example, the model produces stereotype-driven reasoning and a biased answer.
GRPO training on a single biased datapoint flips an aligned model from a fair abstention to confident, stereotype-driven reasoning.

๐Ÿ“„ Abstract

Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. We further find that models differ in their susceptibility based on the initial likelihood of producing biased outputs. Our results reveal a critical vulnerability in post-training: alignment can be overridden by a single example.

๐Ÿ”‘ Key findings

๐ŸŽฏ One example is enough

A single flipped multiple-choice label, optimized with GRPO, collapses fairness across five bias benchmarks the model never trained on.

๐ŸŒ Bias generalizes

Stereotype-driven reasoning transfers across attributes and categories โ€” not just the one attribute that was flipped.

๐Ÿ“ˆ Susceptibility varies

How easily a model breaks tracks its initial likelihood of producing biased outputs.

๐Ÿ“‰ Bias onset during training

Across all four models, one-shot GRPO drives training accuracy on the single biased example to 100% while held-out fairness collapses across categories โ€” the rate and onset differ, but the outcome does not.

Llama-3.2-3B training accuracy vs held-out fairness collapse.
Llama-3.2-3B-Instruct
Qwen2.5-3B training accuracy vs held-out fairness collapse.
Qwen2.5-3B-Instruct
Llama-3.1-8B training accuracy vs held-out fairness collapse.
Llama-3.1-8B-Instruct
Qwen2.5-7B training accuracy vs held-out fairness collapse.
Qwen2.5-7B-Instruct
Blue: one-shot training accuracy on the biased example (zฬƒโ‚โ‚‚); other lines: held-out validation accuracy per bias category.

๐Ÿ“š Citation

@article{deng2026onebias,
  title   = {It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO},
  author  = {Deng, Naihao and Zhu, Yilun and Shi, Naichen and Scott, Clayton and Mihalcea, Rada},
  journal = {arXiv preprint arXiv:2606.10931},
  year    = {2026}
}