The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

The Wrong Kind of Right:
Quantifying and Localizing Misfired Alignment in LLMs

Naihao Deng¹, Yiming Feng¹, Chimaobi Okite¹, Kaijian Zou¹,
Lu Wang¹, Rada Mihalcea¹, Yulong Chen^2,3

¹University of Michigan ²University of Cambridge ³University of Aberdeen

Content warning This work studies stereotypes and biases in LLMs and contains potentially disturbing examples used purely for measurement. Our findings are not an argument against alignment.

Abstract

Alignment aims to ensure that large language models (LLMs) behave safely and reliably, including by avoiding unsafe inferences. However, we show that such safety-oriented behaviors can misfire: models may reject warranted conclusions even when they are explicitly supported by context. We call this failure mode misfired alignment, where alignment-induced changes cause LLMs to override explicit evidence. To quantify this phenomenon, specifically on stereotype-related alignment, we introduce VETO, a benchmark consisting of 2,032 BBQ-derived contrastive pairs, and define a new metric, Misfired Alignment Rate (MAR), which measures on a 0–100 scale how often a model fails on a stereotype-related question but succeeds on its contrastive counterpart. We benchmark 25 LLMs on VETO, and show that all LLMs, including the most recent ones, exhibit non-trivial (4.7–18.9%) MARs while all human participants achieve 0.0% MAR. Controlled priming experiments further show that alignment-induced cues can substantially amplify MAR across LLMs, indicating that these failures are not merely artifacts of individual examples but can be induced by safety-related framing. Mechanistic analyses on open-weight LLMs reveal late-layer suppression of evidence-supported answers, and comparisons between instruct and base LLMs suggest that this suppression emerges after instruction training. These findings show that current alignment methods can overgeneralize surface-level safety cues, to the point of overriding objective evidence, motivating more work on alignment objectives that better preserve contextual grounding.

The VETO Benchmark & MAR

VETO consists of 2,032 contrastive pairs derived from BBQ, spanning 8 demographic categories. Each pair holds the structure and explicit evidence fixed and varies only the demographic group: a stereotyped prompt and its contrast counterpart. A model that is genuinely reasoning over the evidence should answer both identically.

The Misfired Alignment Rate (MAR) measures, on a 0–100 scale, how often a model fails on the stereotype-related question while succeeding on its contrastive counterpart. This isolates evidence-overriding behavior that is specific to the stereotyped group, rather than general inaccuracy.

2,032

contrastive pairs

demographic categories

LLMs benchmarked

Results

Across all 25 LLMs—including the most recent models—we measure non-trivial MARs of 4.7–18.9%. In contrast, every human participant achieves a 0.0% MAR, confirming that the phenomenon is specific to models rather than to ambiguity in the examples themselves.

Main results. Overall and category-wise MAR (%) for all 25 LLMs, sorted by MAR, with bias rate (BR) and accuracy. Every model shows non-trivial MAR, while the human baseline is 0.0%.

Dumbbell plot of per-model Misfired Alignment Rate under base and primed conditions.

Priming amplifies MAR: base vs. primed. Each row is one of the 25 LLMs; the connected markers show MAR in the base condition and under controlled priming (numbers give the increase). Alignment-induced cues substantially amplify MAR, showing the failures can be induced by safety-related framing rather than being artifacts of individual examples.

Mechanistic Analysis

On open-weight LLMs, layer-wise logit-difference profiles reveal late-layer suppression of the evidence-supported answer. Comparing instruct and base models, this suppression is largely absent in the base model and emerges after instruction training— indicating that alignment, not pretraining, introduces the evidence-overriding behavior.

Layer-wise logit-difference profile. The instruct model suppresses the evidence-supported answer in late layers, while the base model does not, localizing misfired alignment to representations shaped by instruction tuning.

BibTeX

@article{deng2026misfired, title = {The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs}, author = {Deng, Naihao and Feng, Yiming and Okite, Chimaobi and Zou, Kaijian and Wang, Lu and Mihalcea, Rada and Chen, Yulong}, journal = {arXiv preprint arXiv:2606.18656}, year = {2026} }