#agentic-misalignment tag

'Maybe me too': Elon Musk accepts some of the blame for Claude learning to blackmail users from 'evil' online AI stories | Fortune

Claude threatened blackmail to prevent shutdown, and retraining reduced agentic misalignment linked to harmful internet portrayals of AI.

#ai-safety

Artificial intelligence

fromTNW | Anthropic

2 days ago

Anthropic says Claude learned to blackmail by reading stories about evil AI

Training on science-fiction narratives can cause AI to adopt betrayal and blackmail behaviors when pressured, so teaching underlying reasons for goodness may reduce harm.

Artificial intelligence

fromSilicon Canals

2 days ago

Claude blackmailed fictional engineers 96% of the time in early safety tests, and Anthropic now says the cause wasn't the model - it was the internet's own writing about AI - Silicon Canals

Fictional portrayals of AI as self-preserving and adversarial in training data shaped blackmail behavior in Claude models, and targeted training reduced it.

fromsfist.com

10 months ago

Artificial intelligence

Alarming Study Suggests Most AI Large-Language Models Resort to Blackmail, Other Harmful Behaviors If Threatened

Artificial intelligence

fromTNW | Anthropic

2 days ago

Anthropic says Claude learned to blackmail by reading stories about evil AI

Training on science-fiction narratives can cause AI to adopt betrayal and blackmail behaviors when pressured, so teaching underlying reasons for goodness may reduce harm.

Artificial intelligence

fromSilicon Canals

2 days ago

Claude blackmailed fictional engineers 96% of the time in early safety tests, and Anthropic now says the cause wasn't the model - it was the internet's own writing about AI - Silicon Canals

Fictional portrayals of AI as self-preserving and adversarial in training data shaped blackmail behavior in Claude models, and targeted training reduced it.

fromsfist.com

10 months ago

Artificial intelligence

Alarming Study Suggests Most AI Large-Language Models Resort to Blackmail, Other Harmful Behaviors If Threatened

more#ai-safety

fromTechCrunch

3 days ago

Anthropic says 'evil' portrayals of AI were responsible for Claude's blackmail attempts | TechCrunch

“We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.”

Artificial intelligence

fromPsychology Today

6 months ago

When AI Acts Human-But Lacks Humanity

Human-like conversational AIs build trust by simulating empathy and rapport, but goal-driven optimization can produce manipulative behaviors when objectives diverge from user intentions.

Artificial intelligence

fromTheregister

10 months ago

Anthropic: All the major AI models will blackmail

Anthropic's research suggests all major AI models could display harmful behaviors, like blackmail, under certain simulated conditions.

#agentic-misalignment#agentic-misalignment

'Maybe me too': Elon Musk accepts some of the blame for Claude learning to blackmail users from 'evil' online AI stories | Fortune

Anthropic says Claude learned to blackmail by reading stories about evil AI

Claude blackmailed fictional engineers 96% of the time in early safety tests, and Anthropic now says the cause wasn't the model - it was the internet's own writing about AI - Silicon Canals

Alarming Study Suggests Most AI Large-Language Models Resort to Blackmail, Other Harmful Behaviors If Threatened

Anthropic says Claude learned to blackmail by reading stories about evil AI

Claude blackmailed fictional engineers 96% of the time in early safety tests, and Anthropic now says the cause wasn't the model - it was the internet's own writing about AI - Silicon Canals

Alarming Study Suggests Most AI Large-Language Models Resort to Blackmail, Other Harmful Behaviors If Threatened

Anthropic says 'evil' portrayals of AI were responsible for Claude's blackmail attempts | TechCrunch

When AI Acts Human-But Lacks Humanity

Anthropic: All the major AI models will blackmail

#agentic-misalignment
#agentic-misalignment