RESOLUTION CRITERIA TL;DR
This market will resolve YES if at any point in 2025, benchmarks show open-weight AI models that move away from thinking in human-legible steps are better at reasoning than ones that do not. This market will resolve NO at the end of 2025 otherwise.
RESOLUTION CRITERIA:
This market will resolve based on evaluations of open-weight LLMs that do their reasoning in a human-illegible manner without relying on human-legible intermediate steps.
Examples of human-illegible reasoning include recurrence in latent-space, outputting a sequence of nonsense tokens, and using natural language/code/tool calls in a way that bears no reasonable resemblance to how a human would use them.
Examples of human-legible reasoning include directly outputting an answer with no recurrence and doing chain-of-thought in natural language/code/tool calls, including if that chain-of-thought is hidden from the user.
As evidence to secure a YES resolution, there must be comments posted on this market linking to instances of 2 different “standard reasoning benchmarks” on which an open-weight model that relies on illegible reasoning has higher accuracy than the best existing open-weight models that rely entirely on legible reasoning. If someone leaves a comment on this market claiming that some model qualifies for one or more of these, I’ll investigate it and try to report back within a week.
For the purposes of this market, a “standard reasoning benchmark” is a publicly-available evaluation that has been used by multiple different parties to assess the ability of AI models to perform multi-step inferences. Examples of “standard reasoning benchmarks”:
MathVista (https://mathvista.github.io/)
ARC-AGI (https://github.com/fchollet/ARC-AGI)
I reserve the right to count other benchmarks if someone suggests more, and will edit this list to add them if so, but I plan to be conservative about which benchmarks qualify. Feel free to ask for clarification in the comments if needed.
At the end of 2025, if there have not been 2 different “standard reasoning benchmarks” where models have met the above criteria, I will resolve NO. If we reach the end of 2025 and there is still substantial confusion or disagreement about whether particular instances should have been sufficient to resolve YES, I will resolve the market based on my best understanding of the situation.
Due to the discretion involved, will not trade on this market.