Teach Claude why

vm_admin

1 week ago

By @AnthropicAI
Publication Date: 2026-05-08 12:00:00

Last year we published a case study on drug misalignment. In experimental scenarios, we showed that AI models from many different developers sometimes took completely misguided actions when they encountered (fictitious) ethical dilemmas. In one much-discussed example, the models blackmailed engineers to avoid a shutdown.

When we first published this research, our best performing boundary models were from the Claude 4 family. This was also the first family of models for which we conducted a live alignment assessment during training;¹ Agent misalignment was one of several behavioral issues that emerged. Therefore, after Claude 4, it was clear that we needed to improve our safety training and since then we have significantly updated our safety training.

We use agent misalignment as a case study to highlight some of the techniques we have found to be surprisingly effective. In fact, every Claude model since Claude Haiku 4.5² achieved a perfect result with the agent…