
This market is duplicated from and inspired from
/Manifold/what-will-be-the-best-performance-o-nzPCsqZgPc
The best performance by an AI system on the new Last Exam benchmark as of December 31st 2025.
https://lastexam.ai/

Resolution criteria
Resolves to the best AI performance on the multimodal version of the Last Exam. This resolution will use https://scale.com/leaderboard/humanitys_last_exam as its source, if it remains up to date at the end of 2025. Otherwise, consensus of reliable sources may be used (or Moderator consensus).
If the number reported is exactly on the boundary (eg. 10%) then the higher choice will be used (ie. 10-20%).
@Jolliest hmmmmm it is curious that grok 4 is missing from the leaderboard. If it wasn't for that i'd be sure it's up to date but I have to reserve judgement right now bc i can't really tell. but more recent models are present so idk i'll preliminarily say it's up to date. the intent is definitely to prioritize only considering scaffolds + ai systems allowed under the scale ai leaderboard section.
Are you counting the unconfirmed result of Deepthink of 34.8% no tools that was posted by Google?
I am not counting this unless it goes on the leaderboard, at this time
lots of arb possible with my market https://manifold.markets/jim/when-will-humanitys-last-exam-be-sa
@jim why you call it live access? It doesn’t go to math forum and make a post about math problems.
You can replicate internet access by scraping it and using as a giant database
@mathvc yeah but using giant database or the web seems like it's less reliant on the AI model's innate knowledge and intelligence, more reliant on human knowledge and intelligence.
i've edited the market description a bit to not be dependent on my own discretion for what model counts or doesn't count. now it uses a consensus of reliable sources or moderator consensus, instead of my own opinion. 🤷♂️ probably won't come up anyway but i realized i was amassing a decent position so
@Bayesian sounds like an incentive to finetune my deepseek-giga-overfitter-hle-memorized-v1 model by EOY
@Bayesian most likely, but maybe they'll just put an asterisk and scold it in a footnote for being sus & bad. unclear how enforcement is actually handled in practice
/Bayesian/which-of-frontiermath-and-humanitys
@mathvc @copiumarc may the person with the best model of reality win
@qumeric if the benchmark is knowledge heavy it might not do that much better than 4o? prolly will tho. just some low chance that it doesn't