Objectives

Auditing language models for hidden objectives

vm_adminMarch 14, 2025

A new paper from the Anthropic Alignment Science and Interpretability teams studies alignment audits—systematic investigations into whether models are pursuing…