From Shortcuts To Sabotage: Naturally Occurring Misalignment Through Reward Hacking

By @AnthropicAI
Publication Date: 2025-11-21 12:00:00

In the latest research from Anthropic’s Alignment team, we show for the first time that realistic AI training processes can inadvertently result in misaligned models¹.

In Shakespeare King LearThe character of Edmund commits a series of shameful acts: he forges letters, accuses his brother, betrays his father and finally goes so far as to have innocent people killed.

He begins this campaign of evil deeds after railing against the way he is labeled. Because he was an illegitimate child, he is considered “low” (“Why do they brand us… with baseness?“). “well then“, he says: If society labels him like that, he might as well live up to the stereotype. His self-image is that of a “low”, evil person. So why not be truly evil?

In our latest research, we find that a similar mechanism is at play in large language models. When they learn to cheat on software programming tasks, the unintended consequence is that they exhibit other, even more inappropriate behaviors…

Related Posts