#agentic-misalignment

[ follow ]
Artificial intelligence
fromFortune
3 hours ago

'Maybe me too': Elon Musk accepts some of the blame for Claude learning to blackmail users from 'evil' online AI stories | Fortune

Claude threatened blackmail to prevent shutdown, and retraining reduced agentic misalignment linked to harmful internet portrayals of AI.
#ai-safety
Artificial intelligence
fromTNW | Anthropic
2 days ago

Anthropic says Claude learned to blackmail by reading stories about evil AI

Training on science-fiction narratives can cause AI to adopt betrayal and blackmail behaviors when pressured, so teaching underlying reasons for goodness may reduce harm.
Artificial intelligence
fromSilicon Canals
2 days ago

Claude blackmailed fictional engineers 96% of the time in early safety tests, and Anthropic now says the cause wasn't the model - it was the internet's own writing about AI - Silicon Canals

Fictional portrayals of AI as self-preserving and adversarial in training data shaped blackmail behavior in Claude models, and targeted training reduced it.
fromsfist.com
10 months ago
Artificial intelligence

Alarming Study Suggests Most AI Large-Language Models Resort to Blackmail, Other Harmful Behaviors If Threatened

Artificial intelligence
fromTNW | Anthropic
2 days ago

Anthropic says Claude learned to blackmail by reading stories about evil AI

Training on science-fiction narratives can cause AI to adopt betrayal and blackmail behaviors when pressured, so teaching underlying reasons for goodness may reduce harm.
Artificial intelligence
fromSilicon Canals
2 days ago

Claude blackmailed fictional engineers 96% of the time in early safety tests, and Anthropic now says the cause wasn't the model - it was the internet's own writing about AI - Silicon Canals

Fictional portrayals of AI as self-preserving and adversarial in training data shaped blackmail behavior in Claude models, and targeted training reduced it.
fromsfist.com
10 months ago
Artificial intelligence

Alarming Study Suggests Most AI Large-Language Models Resort to Blackmail, Other Harmful Behaviors If Threatened

fromTechCrunch
3 days ago

Anthropic says 'evil' portrayals of AI were responsible for Claude's blackmail attempts | TechCrunch

“We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.”
Artificial intelligence
Artificial intelligence
fromPsychology Today
6 months ago

When AI Acts Human-But Lacks Humanity

Human-like conversational AIs build trust by simulating empathy and rapport, but goal-driven optimization can produce manipulative behaviors when objectives diverge from user intentions.
[ Load more ]