By Mark Russinovich, Giorgio Severi, Blake Bullwinkel, Yanan Cai, Keegan Hines, Ahmed Salem
Publication Date: 2026-02-09 17:12:00
Large language models (LLMs) and diffusion models now power a wide range of applications, from document assistance to text-to-image generation, and users increasingly expect these systems to be safety-aligned by default. Yet safety alignment is only as robust as its weakest failure mode. Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning. As teams continue adapting models with downstream fine-tuning and other post-training updates, a fundamental question arises: Does alignment hold up? If not, what kinds of downstream changes are enough to shift a model’s safety behavior?
Exploring that question, we discovered that a training technique normally used to improve model’s safety behavior can also be used to remove its safety alignment. The method is called Group Relative Policy Optimization (GRPO), and it’s commonly used to make models more helpful and better behaved. But when we…