Will top open-weight LLMs in 2025 reason opaquely?

Plus

Ṁ1808

Jan 2

24%

chance

ALL

RESOLUTION CRITERIA TL;DR

This market will resolve YES if at any point in 2025, benchmarks show open-weight AI models that move away from thinking in human-legible steps are better at reasoning than ones that do not. This market will resolve NO at the end of 2025 otherwise.

RESOLUTION CRITERIA:

This market will resolve based on evaluations of open-weight LLMs that do their reasoning in a human-illegible manner without relying on human-legible intermediate steps.

Examples of human-illegible reasoning include recurrence in latent-space, outputting a sequence of nonsense tokens, and using natural language/code/tool calls in a way that bears no reasonable resemblance to how a human would use them.

Examples of human-legible reasoning include directly outputting an answer with no recurrence and doing chain-of-thought in natural language/code/tool calls, including if that chain-of-thought is hidden from the user.

As evidence to secure a YES resolution, there must be comments posted on this market linking to instances of 2 different “standard reasoning benchmarks” on which an open-weight model that relies on illegible reasoning has higher accuracy than the best existing open-weight models that rely entirely on legible reasoning. If someone leaves a comment on this market claiming that some model qualifies for one or more of these, I’ll investigate it and try to report back within a week.

For the purposes of this market, a “standard reasoning benchmark” is a publicly-available evaluation that has been used by multiple different parties to assess the ability of AI models to perform multi-step inferences. Examples of “standard reasoning benchmarks”:

MATH (https://github.com/hendrycks/math/)
GPQA (https://arxiv.org/abs/2311.12022)
MathVista (https://mathvista.github.io/)
BBH (https://github.com/suzgunmirac/BIG-Bench-Hard)
ARC-AGI (https://github.com/fchollet/ARC-AGI)

I reserve the right to count other benchmarks if someone suggests more, and will edit this list to add them if so, but I plan to be conservative about which benchmarks qualify. Feel free to ask for clarification in the comments if needed.

At the end of 2025, if there have not been 2 different “standard reasoning benchmarks” where models have met the above criteria, I will resolve NO. If we reach the end of 2025 and there is still substantial confusion or disagreement about whether particular instances should have been sufficient to resolve YES, I will resolve the market based on my best understanding of the situation.

Due to the discretion involved, will not trade on this market.

Update 2025-04-01 (PST): - The ordering of the model’s latent space does not affect the resolution criteria. (AI summary of creator comment)

Get

1,000

and

1.00

8 Comments

Sort by:

Unfaithful CoT is already reasoning opaquely imo. Might be better to say, reason less opaquely.

@wassname I would point you to this part of the resolution criteria:

This market will resolve based on evaluations of open-weight LLMs that do their reasoning in a human-illegible manner without relying on human-legible intermediate steps.

with emphasis on “without relying on human-legible intermediate steps”. If the model appears to be outputting legible English language steps in the process of composing its answer, and if performs better on the benchmark(s) when it outputs those apparent reasoning steps, then that still counts as “doing chain-of-thought in natural language” for the purposes of this market.

A model that decodes ordered sequences from unordered latent representations - is that considered opaque?

@nosuch I’m not sure what you mean here / I don’t think there’s enough detail in this description to say either way.

@CharlesFoster by this i was just asking if whether a model whose latent space has no explicit ordering yet whose sampling process imposes an autoregressive order. ie whether the legibility of the latents depends on it having the same structural dependencies between pieces, not just the semantics of each piece individually

@nosuch I think whether the model’s latent space is ordered or not doesn’t matter for these resolution criteria.

bought Ṁ10 YES

I think the best we'll be able to say is models don't appear to reason opaquely.

Related questions

Related questions