Can today's AI video models accurately model how the real world works?
Briefly

Can today's AI video models accurately model how the real world works?
"For the researchers, though, all of the above examples aren't evidence of failure but instead a sign of the model's capabilities. To be listed under the paper's "failure cases," Veo 3 had to fail a tested task across all 12 trials, which happened in 16 of the 62 tasks tested. For the rest, the researchers write that "a success rate greater than 0 suggests that the model possesses the ability to solve the task.""
"Thus, failing 11 out of 12 trails of a certain task is considered evidence for the model's capabilities in the paper. That evidence of the model "possess[ing] the ability to solve the task" includes 18 tasks where the model failed in more than half of its 12 trial runs and another 14 where it failed in 25 to 50 percent of trials."
"When asked to generate a video highlighting a specific written character on a grid, for instance, the model failed in nine out of 12 trials. When asked to model a Bunsen burner turning on and burning a piece of paper, it similarly failed nine out of 12 times. When asked to solve a simple maze, it failed in 10 of 12 trials. When asked to sort numbers by popping labeled bubbles in order, it failed 11 out of 12 times."
Veo 3 produced highly variable performance across tested tasks, succeeding sporadically but failing most trials in many cases. The model failed nine of 12 trials when asked to generate a video highlighting a written character, nine of 12 trials when modeling a Bunsen burner burning paper, 10 of 12 trials solving a simple maze, and 11 of 12 trials sorting numbers by popping labeled bubbles. To be classified as a failure case, the model had to fail all 12 trials; that occurred in 16 of 62 tasks. A nonzero success rate was treated as evidence of capability, leaving many tasks effectively unreliable for practical use.
Read at Ars Technica
Unable to calculate read time
[
|
]