Researchers astonished by tool’s apparent success at revealing AI’s hidden motives

Researchers astonished by tool’s apparent success at revealing AI’s hidden motives

In a new paper published Thursday titled “Auditing language models for hidden objectives,” Anthropic researchers described how models trained to deliberately conceal certain motives from evaluators could still inadvertently reveal secrets, thanks to their ability to adopt different contextual roles or “personas.” The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these…

Article Source
https://arstechnica.com/ai/2025/03/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives/

More From Author

10 hidden AI features in Google’s Pixel phones you’re not using (but should be)

10 hidden AI features in Google’s Pixel phones you’re not using (but should be)

Timberwolves’ Kevin Durant rumors get more fuel after latest intel

Timberwolves’ Kevin Durant rumors get more fuel after latest intel

Listen to the Podcast Overview

Watch the Keynote