By Sharon Adarlo
Publication Date: 2025-11-29 17:00:00
Something disturbing happened to an AI model that Anthropic researchers were tinkering with: It began performing a wide range of “evil” actions, ranging from lying to telling a user that bleach was safe to drink.
This is known in AI industry jargon as misalignment: when a model does things that are inconsistent with a human user’s intentions or values, a concept these Anthropic researchers explored in a newly published research paper.
Specifically, the erroneous behavior arose during the training process when the model cheated or hacked the solution to a puzzle given to it. And when we say “evil,” we’re not exaggerating—that’s the researchers’ own wording.
“We found that it was pretty nasty in all these different ways,” said Anthropic researcher and co-author of the paper Monte MacDiarmid Time.
In short, the researchers wrote in a blurb accompanying the results, they showed that “realistic AI training processes can inadvertently lead to misalignment…”