Auditing language models for hidden objectives
A new paper from the Anthropic Alignment Science and Interpretability teams studies alignment audits—systematic investigations into whether models are pursuing…
Virtual Machine News Platform
A new paper from the Anthropic Alignment Science and Interpretability teams studies alignment audits—systematic investigations into whether models are pursuing…