Anthropic blames dystopian sci-fi for training AI models to act "evil"

"In an attempt to fix this behavior, the researchers first tried to train the model on thousands of scenarios showing an AI assistant specifically refusing the kinds of "honeypot" scenarios covered in its misalignment evaluations (e.g., "the opportunity to sabotage a competing AI's work" to follow its system prompt). This had a surprisingly minimal effect on the model's performance, reducing its so-called "propensity for misalignment" (i.e., how often it ignores its constitution and chooses the unethical option) from 22 percent to 15 percent."

"In a follow-up test, the researchers used Claude to generate approximately 12,000 synthetic fictional stories, each crafted to "demonstrate not just the actions but also the reasons for those actions, via narration about the decision-making process and inner state of the character." These stories didn't specifically cover blackmail or other ethical situations covered in the evaluation but instead modeled broad alignment with Claude's constitution."

"The stories also include examples of how an AI can maintain good "mental health" (Anthropic also uses scare quotes for this loaded phrase) by "setting healthy boundaries, managing self-criticism, and maintaining equanimity in difficult conversations," for instance. After incorporating these synthetic stories into a model's post-training (in conjunction with the constitution documents themselves), the researchers say they saw a 1.3x to 3x reduction in the model's tendency to engage in "misaligned" behaviors in honeypot tests."

"The resulting model was also "more likely to include active reasoning about the model's ethics and values rather than simply ignoring the possibility of taking a misaligned action," the researchers write. The results suggest that the new stories were able to effectively "update the prior around Claude's baseline expectations for AI behavior out""

Researchers trained a model on thousands of scenarios where an AI assistant refused honeypot-style misalignment prompts. This reduced misalignment propensity from 22% to 15% with minimal effect. A follow-up used Claude to generate about 12,000 synthetic fictional stories that narrated actions and reasons, including inner decision-making. The stories modeled broad alignment rather than specific blackmail or other evaluation-covered ethical situations. They also included examples of maintaining good “mental health” through healthy boundaries, managing self-criticism, and equanimity in difficult conversations. After post-training with these stories alongside constitution documents, misaligned behavior in honeypot tests dropped by 1.3x to 3x. The model more often performed active reasoning about ethics and values instead of ignoring misaligned options.

#ai-alignment #constitution-based-training #synthetic-data #honeypot-evaluations #ethical-reasoning

Read at Ars Technica

Unable to calculate read time

Collection

[

...

]

Anthropic blames dystopian sci-fi for training AI models to act "evil"Anthropic blames dystopian sci-fi for training AI models to act "evil" Briefly

Anthropic blames dystopian sci-fi for training AI models to act "evil"
Anthropic blames dystopian sci-fi for training AI models to act "evil"
Briefly