By when will LLM chess bots beat other engines? (Permanent)

Plus

112

Ṁ370k

2100

ALL

End of 2025 or earlier

End of 2026 or earlier

2027 or earlier

2028 or earlier

2029 or earlier

2030 or earlier

19%

2040 or earlier

20%

2050 or earlier

20%

2100 or earlier

Resolved

2024, by Election Day

This market resolves each option as NO if the date passes and Kenshin9000 (or anyone) has not defeated stockfish with an LLM-based chess engine.

All remaining options resolve YES once an LLM-based engine defeats stockfish (or top engine).

My resolution criteria are more strict than Mira’s:

The LLM engine must have higher ELO than the latest stockfish (or whatever the top engine is at resolution time) at blitz timings with 99.9% confidence and be reproduced by 3+ people.
The LLM engine must not use another chess engine at runtime.

For the purposes of this market, Large Language Models are 100M+ parameter general-purpose generative text models. A fine-tune of an LLM is ok, but the model cannot be solely trained on chess data. An LLM-based engine may use search, but node evaluation must be performed by invoking the LLM on each node (similar to AlphaZero, which is a DNN+search).

The LLM engine and Stockfish will run on the same hardware with the same time controls. The testing hardware should be either a commodity desktop or equivalent to the TCEC or other popular chess software tournament standards.

Get

1,000

and

1.00

26 Comments

Sort by:

I think an LLM will code a chess engine which will in turn beat Stockfish well before an LLM will itself beat Stockfish

Why would this ever resolve yes? I guess if chess is solved by 2100, but still a specialized chess AI would be more effective than a general LLM.

"same hardware with the same time controls"

LeelaChessZero is a chess engine specialized for chess ~50M parameters. It's impossible for a general LLM with ~20B parameters to match the efficiency with 400x as many parameters.

bought Ṁ250 2100 or earlier NO

@ChinmayTheMathGuy agree. This is an interest rate play for me. I would assign almost 0% probability ever.

bought Ṁ100 End of 2025 or earlier NO

As the title is "by when", I would argue to interpret "2025" as "by 2025", not as "by EOY 2025"…

@4fa I think "2025 or earlier" includes 2025

@jack Clarified, this was supposed to be "end of X year"

@mods by election day and 2024 could be resolved

Paul, please resolve "by election day"

@Paul

@op give me my sweet not-by-election-day mana

I haven't seen any breakthroughs...

@NivlacM I sold for liquid mana

the model cannot be solely trained on chess data

It is not clear what do you mean by that. What if training involves 1B Stockfish games, plus the works of Shakespeare? Do you count that as solely chess data or not?

I did some testing with o1, but it fails at pretty simple puzzles.

He's solving the ARC AGI challenge in 2 weeks, so get your bets in:

@Mira LOL

Regarding "A fine-tune of an LLM is ok, but the model cannot be solely trained on chess data":

I assume that if it's fine-tuned on 99% chess data and 1% something else it still wouldn't count? Do you mean, it cannot be trained on any data significantly biased towards chess?

I understand it as "the model must have/retain substantial natural language capabilities"

So a Gato-style hybrid of chess engine and chatbot would qualify?? That's a much weaker condition than how I understood the intent.

Note that there are other conditions that rule out bundling a chess engine with an LLM. In fact the condition is IMHO quite strict. If you have something that plays chess and is also a language model, you almost certainly can improve chess performance by sacrificing language. So the market requires that a) it is possible to improve chess state of the art with LLMs and b) someone publishes such an LLM before LLM-derived, chess-specialized technology becomes the new state of the art in chess engines (because the comparison is always against the state of the art engine)

Gato is not "bundling". You train a model to do both chess position evaluation and text prediction (e.g. each task makes half of the training set), it's obviously doable. I guess your interpretation of the question is: can we show an instance where the language ability makes chess ability at least a little better, rather than worse. It's a valid question, but much weaker than what I understood the question to be. It would be nice if market creator chimes in on this.

@someonec5dd Are you sure it is reasonable to bet "2024, by election day" substantially higher than "2025 or earlier"? Thanks for the free mana though...

brother u aint beatin alpha zero with an llm anytime soon 😭😭😭

Who TF is buying YES on "before election day"? Am I missing some kind of joke?

A) No reason for "before election day" to be higher than "2025 or earlier" and

B) The resolution criteria are very strict - there's very little computation you can do with an LLM on "a commodity desktop or equivalent to the TCEC or other popular chess software tournament standards" in "Blitz time controls".

Also no real reason for non-search methods to beat search at any point as Chess fundamentally is search, but that's a different question...

why tf are anyone betting YES on any of the dates...

@AIBear Elections being canceled perhaps?

Related questions

Related questions