Invalid contract
Background
PhysBench is a 10 k‑item, video‑image‑text benchmark that tests whether a vision–language model (VLM) can reason about the real‑world physics that governs everyday objects and scenes. It covers four domains—object Properties, object Relationships, Scene understanding and future‑state Dynamics—split into 19 fine‑grained tasks such as mass comparison, collision outcomes and fluid behaviour. Unlike most other benchmarks, humans still outperform AI on the PhysBench.
State of play:
• Human reference accuracy: 95.87 %
• 2024 AI accuracy (O1): 55.11 %
Why reaching human‑level on PhysBench is a big milestone:
Physics‑consistent video generation – A model that masters all four PhysBench domains should be able to create long‑form videos, ads or even feature films in which liquids pour, cloth folds and shadows move exactly as they would in the real world, eliminating today’s physics mistakes seen in AI generated videos. PhysBench is the litmus test for whether next‑generation multimodal models can move from “smart autocomplete” to physically grounded intelligence—a prerequisite for everything from autonomous robots to cinematic movies.
Resolution Criteria
This market resolves to the year bracket in which a fully automated AI system first achieves an average accuracy of 95% or higher (human‑level) on the PhysBench ALL metric.
Verification – The claim must be confirmed by either
a peer‑reviewed paper on arXiv
a public leaderboard entry on PhysBench Official Website.
Compute resources – Unlimited.
Fine Print: Clarification
If the milestone is first reached in a given year, only the earliest bracket that still contains that year resolves YES; all other brackets resolve NO.
Example: Should an AI system hit 95 % on PhysBench in 2025, only “Before Jan 2027” wins, all other brackets resolve NO.If no AI model reaches 95% by 31 Dec 2041, the market resolves to “Not Applicable.”