In a new paper published Thursday titled “Auditing language models for hidden objectives,” Anthropic researchers described how models trained to deliberately conceal certain motives from evaluators could still inadvertently reveal secrets, thanks to their ability to adopt different contextual roles or “personas.” The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these…
Article Source
https://arstechnica.com/ai/2025/03/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives/