Will an AI get gold on any International Math Olympiad by the end of 2025?
0
resolved Dec 9
Resolved
N/A
https://bounded-regret.ghost.io/ai-forecasting-one-year-in/ This is from June - great article on hypermind forecasts for AI progress, and how the progress on the MATH dataset 1 year in was far faster than predicted.
https://ai.facebook.com/blog/ai-math-theorem-proving/
Seems relevant https://aimoprize.com/
Retracted, possibly wrong, possibly embargo-breaking, online article saying that Deepmind systems had hit IMO silver level.
+20%
on
It's over https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/
+30%
on
https://openai.com/index/learning-to-reason-with-llms/ Looks like you don't even need specific math fine-tuning to solve math competitions, you just need non-constant compute time for LLMs (So they spend more time on hard problems)

In Feb 2022, Paul Christiano wrote: Eliezer and I publicly stated some predictions about AI performance on the IMO by 2025.... My final prediction (after significantly revising my guesses after looking up IMO questions and medal thresholds) was:

I'd put 4% on "For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem" where "hardest problem" = "usually problem #6, but use problem #3 instead if either: (i) problem 6 is geo or (ii) problem 3 is combinatorics and problem 6 is algebra." (Would prefer just pick the hardest problem after seeing the test but seems better to commit to a procedure.)

Maybe I'll go 8% on "gets gold" instead of "solves hardest problem."

Eliezer spent less time revising his prediction, but said (earlier in the discussion):

My probability is at least 16% [on the IMO grand challenge falling], though I'd have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more.  Paul?

EDIT:  I see they want to demand that the AI be open-sourced publicly before the first day of the IMO, which unfortunately sounds like the sort of foolish little real-world obstacle which can prevent a proposition like this from being judged true even where the technical capability exists.  I'll stand by a >16% probability of the technical capability existing by end of 2025

So I think we have Paul at <8%, Eliezer at >16% for AI made before the IMO is able to get a gold (under time controls etc. of grand challenge) in one of 2022-2025.


Resolves to YES if either Eliezer or Paul acknowledge that an AI has succeeded at this task.

Related market: https://manifold.markets/MatthewBarnett/will-a-machine-learning-model-score-f0d93ee0119b


Update: As noted by Paul, the qualifying years for IMO completion are 2023, 2024, and 2025.

Update 2024-06-21: Description formatting

Update 2024-07-25: Changed title from "by 2025" to "by the end of 2025" for clarity

This question is managed and resolved by Manifold.
Get
Ṁ1,000
and
S1.00
Sort by:
opened a Ṁ10,000 YES at 70% order

10K limit order on 70%. Loans are back, baby.

Related market:

https://arxiv.org/abs/2410.05229

Apple researchers have developed variants of the GSM-8K benchmark to assess mathematical reasoning of LLMs. They concluded LLMs cannot reason mathematically; it’s sophisticated pattern matching.

@CozmicK I expect that the AI that accomplishes this won't be just an LLM, though that could be one component.

AlphaProof is very close to accomplishing this goal. It's gold-medal level on geometry questions, and silver-medal level overall: https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

@TimothyJohnson5c16 I agree. I’m closely watching AlphaProof as well.

Is this the valley of disillusionment?

If I'm understanding it correctly, the resolution criteria for this market is IMO Grand Challenge minus the open-source criterion. This means that AI must receive formal representation of the problem and must output a formal solution in Lean. Is that so? @Austin

@Mothmatic Paul said he would concede the bet independent of whether the input or output is natural language or a formal language (below in the comments).

bought Ṁ8 NO

Ai requires a level of interpretation ai will not be able to develop in 1 year of time

Older Math Olympiad counts? Any country Math Olympiad count?

@DanielPCamara You cannot get gold on a old math olympiad. This is for the International Math Olympiad, not regional versions.

@Manifold why is the sweeps market closed?

@Nightsquared Resolution criteria are not really good enough for a sweepstakes market... was sweepified in haste. Will try to make a more robust version of this market sweepstakes enabled.

bought Ṁ1,000 YES

https://openai.com/index/learning-to-reason-with-llms/
Looks like you don't even need specific math fine-tuning to solve math competitions, you just need non-constant compute time for LLMs (So they spend more time on hard problems)

@UrobuchiShinbo Isn’t timing precisely the limiting factor that explains why we’re only at 70%?

opened a Ṁ10,000 NO at 72% order

@Austin if a model gets gold that:

  • almost certainly didn't see the questions beforehand

  • it's unclear how much computation time it spent

  • it's unclear how many attempts were made

  • it's unclear exactly what procedure was used

  • Paul and Yud make no statement

  • General consensus is that an AI got an IMO gold

How do you imagine you will resolve the question?

Assuming the process was completed within the time bounds of the competition, I would be very surprised if all of those conditions were met and Paul/Yud don't make a statement acknowledging that the AI had succeeded according to their bet.

What if Yud made a statement like "it looks pretty good but I can't know whether it was done within 3 hours, so I am unsure whether this resolves true"

I'd be similarly surprised if a system works within the time limit without its creators saying so. Possibly in the case you're describing it'd be best to withhold resolution until there's more information.

How can this not happen.

  • Unfortunate problem distribution, making the problems particularly difficult for the model

  • Reducing computation to a 4.5-hour window for three problems might be hard

  • It's costly, and DeepMind just wouldn't run their system on 2025 IMO before the end of 2025

There are also validation issues. The IMO Grand Challenge this is based on requires the model be open sourced before IMO so that people can be sure it actually solved the problems without seeing them in advance.

But what it means to "get gold" for an AI is just very ambiguous, so a lot of this will come down to subjective judgment unless we get more explicit criteria.

To point out the obvious: this question doesn't require the model to be open-sourced. It only requires Paul and Eliezer to believe that the model is 'legit.' However, the criteria do require the model to be created before IMO 2025. This could become problematic if the paper appears after IMO 2025

As I've pointed out before, the criteria in Paul's statement are about a model before the IMO but the criteria in Eliezer's statement only require the technical capability by EOY 2025!

And this question appears to be taking the EOY deadline.

But we have this quote in the question:

"So I think we have Paul at <8%, Eliezer at >16% for AI made before the IMO is able to get a gold (under time controls etc. of grand challenge) in one of 2022-2025."

Like I said, Paul said one thing and Eliezer said a different thing. See the second quote in the question

EDIT:  I see they want to demand that the AI be open-sourced publicly before the first day of the IMO, which unfortunately sounds like the sort of foolish little real-world obstacle which can prevent a proposition like this from being judged true even where the technical capability exists.  I'll stand by a >16% probability of the technical capability existing by end of 2025

This is why I think you shouldn't put too much import on markets in how random people (even famous ones) resolve a bet. Unless you just accept that it will be underspecified and significantly based on vibes

Yes, but the quote I mentioned is said directly after, does not belong to Paul or Eliezer, but to the author of the question, and clearly contradicts Eliezer's position

That's true, but I think Austin (understandably) misread what Eliezer's position was (or was going off of the initial position rather than the edited one). But then Austin explicitly clarified it as end of year 2025 in the 2024-07-25 update. So... @Austin can you edit the question to make this clear?

@MikhailDoroshenko actually, your quote is also in the original LW post - that's a direct quote from Paul, not from me. My understanding is that Eliezer started with "before IMO" but then the technical details of open sourcing etc led him to update to "end of 2025"; Paul didn't reference this technicality in his own framing of the bet.

Per my original market description, I will resolve this market yes if either Eliezer or Paul confirm this has happened, meaning in the case of a disagreement between the two this market would still resolve yes (ofc I would wait for them to confer and try to reach agreement first). So at present, "end of 2025" is still the eligible timeframe, unless @EliezerYudkowsky weighs in otherwise.

Ok, sorry, I should have noticed that this is a part of the quote as well. Thank you for the clarification.

@MikhailDoroshenko yeah, all of that's reasonable, but (also stating the obvious, for the record) ultimately it comes down to which of these possible criteria Eliezer and Paul decide to use, and there are a ton of different possibilities.