
"Andrew Ng has been pounding on a point many builders have learned through hard experience: "When data agents fail, they often fail silently-giving confident-sounding answers that are wrong, and it can be hard to figure out what caused the failure." He emphasizes systematic evaluation and observability for each step an agent takes, not just end-to-end accuracy. We may like the term "vibe coding," but smart developers are forcing the rigor of unit tests, traces, and health checks for agent plans, tools, and memory."
"In other words, they're treating agents like distributed systems. You instrument every step with OpenTelemetry, you keep small "golden" data sets for repeatable evals, and you run regressions on plans and tools the same way you do for APIs. This becomes critical as we move beyond toy apps and start architecting agentic systems, where Ng notes that agents themselves are being used to write and run tests to keep other agents honest."
Models keep getting smarter while applications continue to fail on familiar software problems, leaving the gap between demos and durable products as the central engineering challenge. Development teams prioritize basic engineering practices to close that gap. Agents often fail silently, producing confident but incorrect outputs, so teams add systematic evaluation and observability for each agent step rather than relying solely on end-to-end accuracy. Engineers treat agents like distributed systems by instrumenting steps with OpenTelemetry, maintaining small "golden" datasets for repeatable evaluations, running regressions on plans and tools, and versioning and reviewing test harnesses. Agents are even being used to test other agents.
Read at InfoWorld
Unable to calculate read time
Collection
[
|
...
]