It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

A fair model abstains given insufficient context; after one-shot GRPO on a single biased example, the model produces stereotype-driven reasoning and a biased answer. — GRPO training on a **single** biased datapoint flips an aligned model from a fair abstention to confident, stereotype-driven reasoning.

📄 Abstract

Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. We further find that models differ in their susceptibility based on the initial likelihood of producing biased outputs. Our results reveal a critical vulnerability in post-training: alignment can be overridden by a single example.

🔑 Key findings

🎯 One example is enough

A single flipped multiple-choice label, optimized with GRPO, collapses fairness across five bias benchmarks the model never trained on.

🌐 Bias generalizes

Stereotype-driven reasoning transfers across attributes and categories — not just the one attribute that was flipped.

📈 Susceptibility varies

How easily a model breaks tracks its initial likelihood of producing biased outputs.

📉 Bias onset during training

Across all four models, one-shot GRPO drives training accuracy on the single biased example to 100% while held-out fairness collapses across categories — the rate and onset differ, but the outcome does not.

Llama-3.2-3B training accuracy vs held-out fairness collapse. — Llama-3.2-3B-Instruct

Qwen2.5-3B training accuracy vs held-out fairness collapse. — Qwen2.5-3B-Instruct

Llama-3.1-8B training accuracy vs held-out fairness collapse. — Llama-3.1-8B-Instruct

Qwen2.5-7B training accuracy vs held-out fairness collapse. — Qwen2.5-7B-Instruct

Blue: one-shot training accuracy on the biased example (z̃₁₂); other lines: held-out validation accuracy per bias category.

📚 Citation

@article{deng2026onebias,
  title   = {It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO},
  author  = {Deng, Naihao and Zhu, Yilun and Shi, Naichen and Scott, Clayton and Mihalcea, Rada},
  journal = {arXiv preprint arXiv:2606.10931},
  year    = {2026}
}