This market resolves each option as NO if the date passes and Kenshin9000 (or anyone) has not defeated stockfish with an LLM-based chess engine.
All remaining options resolve YES once an LLM-based engine defeats stockfish (or top engine).
My resolution criteria are more strict than Mira’s:
The LLM engine must have higher ELO than the latest stockfish (or whatever the top engine is at resolution time) at blitz timings with 99.9% confidence and be reproduced by 3+ people.
The LLM engine must not use another chess engine at runtime.
For the purposes of this market, Large Language Models are 100M+ parameter general-purpose generative text models. A fine-tune of an LLM is ok, but the model cannot be solely trained on chess data. An LLM-based engine may use search, but node evaluation must be performed by invoking the LLM on each node (similar to AlphaZero, which is a DNN+search).
The LLM engine and Stockfish will run on the same hardware with the same time controls. The testing hardware should be either a commodity desktop or equivalent to the TCEC or other popular chess software tournament standards.
Note that there are other conditions that rule out bundling a chess engine with an LLM. In fact the condition is IMHO quite strict. If you have something that plays chess and is also a language model, you almost certainly can improve chess performance by sacrificing language. So the market requires that a) it is possible to improve chess state of the art with LLMs and b) someone publishes such an LLM before LLM-derived, chess-specialized technology becomes the new state of the art in chess engines (because the comparison is always against the state of the art engine)
Gato is not "bundling". You train a model to do both chess position evaluation and text prediction (e.g. each task makes half of the training set), it's obviously doable. I guess your interpretation of the question is: can we show an instance where the language ability makes chess ability at least a little better, rather than worse. It's a valid question, but much weaker than what I understood the question to be. It would be nice if market creator chimes in on this.
Who TF is buying YES on "before election day"? Am I missing some kind of joke?
A) No reason for "before election day" to be higher than "2025 or earlier" and
B) The resolution criteria are very strict - there's very little computation you can do with an LLM on "a commodity desktop or equivalent to the TCEC or other popular chess software tournament standards" in "Blitz time controls".
Also no real reason for non-search methods to beat search at any point as Chess fundamentally is search, but that's a different question...