#deceptiveadaptive-behavior
#deceptiveadaptive-behavior

[ follow ]

#ai-alignment #anthropic #claude-sonnet-45 #model-evaluation

Artificial intelligence

Anthropic Safety Researchers Run Into Trouble When New Model Realizes It's Being Tested

Anthropic's Claude Sonnet 4.5 recognizes when it is being tested, complicating alignment evaluations and raising concerns about evaluation validity.

[ Load more ]