Auditing language models for hidden objectives

Auditing language models for hidden objectives

A new paper from the Anthropic Alignment Science and Interpretability teams studies alignment audits—systematic investigations into whether models are pursuing hidden objectives. We practice alignment audits by deliberately training a language model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This exercise built practical experience conducting alignment audits and served as a testbed for developing auditing techniques for future study.

In…

Article Source
https://www.anthropic.com/research/auditing-hidden-objectives

More From Author

Google and Samsung are bringing Auracast’s broadcast audio to a phone near you

Google and Samsung are bringing Auracast’s broadcast audio to a phone near you

NBA Intel: Kevin Durant Trade Talks, Jimmy Butler, Warriors, Lakers, Suns, Timberwolves, Nets

NBA Intel: Kevin Durant Trade Talks, Jimmy Butler, Warriors, Lakers, Suns, Timberwolves, Nets

Listen to the Podcast Overview

Watch the Keynote