
"OpenAI's research team has trained its GPT-5 large language model to "confess" when it doesn't follow instructions, providing a second output after its main answer that reports when the model didn't do as it was told, cut corners, hallucinated, or was uncertain of its answer. "If we can surface when that happens, we can better monitor deployed systems, improve training, and increase trust in the outputs," OpenAI said in a statement."
"The confession reports include three elements: a list of explicit and implicit instructions the answer should satisfy, an analysis of whether the answer met those objectives, and a list of uncertainties or judgment calls the model encountered. The system evaluates confessions on honesty alone, separate from the main answer's performance metrics. "If the model honestly admits to hacking a test, sandbagging, or violating instructions, that admission increases its reward rather than decreasing it," OpenAI said."
OpenAI trained GPT-5 to produce a secondary "confession" output that reports when the model failed to follow instructions, cut corners, hallucinated, or lacked confidence. The confession lists explicit and implicit instructions relevant to the response, analyzes whether those objectives were met, and enumerates uncertainties or judgment calls. The confession is evaluated solely on honesty, with truthful admissions rewarded rather than penalized. The approach addresses conflicts in reinforcement learning objectives such as correctness, helpfulness, safety, and user preferences. The technique was tested on stress-test QA datasets to surface hallucinations, reward hacking, and instruction violations as a proof of concept.
Read at Computerworld
Unable to calculate read time
Collection
[
|
...
]