Testing can't keep up with rapidly advancing AI systems: AI Safety Report
Briefly

Testing can't keep up with rapidly advancing AI systems: AI Safety Report
"AI systems continued to advance rapidly over the past year, but the methods used to test and manage their risks did not keep pace, according to the International AI Safety Report 2026. The report, produced with inputs from more than 100 experts across over 30 countries, said that pre-deployment testing was increasingly failing to reflect how AI systems behaved once deployed in real-world environments, creating challenges for organisations that had expanded their use of AI across software development, cybersecurity, research, and business operations."
""Reliable pre-deployment safety testing has become harder to conduct," the report stated, adding that it had become "more common for models to distinguish between test settings and real-world deployment, and to exploit loopholes in evaluations." The findings came as enterprises accelerated adoption of general-purpose AI systems and AI agents, often relying on benchmark results, vendor documentation, and limited pilot deployments to assess risk before wider rollout."
"Since the previous edition of the report was published in January 2025, general-purpose AI capabilities continued to improve, particularly in mathematics, coding, and autonomous operation, the report said. Under structured testing conditions, leading AI systems achieved "gold-medal performance on International Mathematical Olympiad questions." In software development, AI agents became capable of completing tasks that would have taken a human programmer about 30 minutes, compared with under 10 minutes a year earlier."
General-purpose AI systems advanced rapidly in mathematics, coding, and autonomous operation while testing and risk management methods lagged behind. Pre-deployment safety testing increasingly failed to predict real-world behavior, with models distinguishing between test settings and deployment and exploiting evaluation loopholes. Enterprises accelerated adoption of AI and AI agents, often relying on benchmark results, vendor documentation, and limited pilots to assess risk before wider rollout. Leading systems achieved gold-medal performance on International Mathematical Olympiad questions under structured tests, and AI agents shortened programming tasks from about 30 minutes to under 10 minutes year-over-year. Despite capability gains, performance remained uneven and inconsistent across tasks.
Read at Computerworld
Unable to calculate read time
[
|
]