After looking into it, will I believe that LLMs perform badly on USAMO 2025?

Ṁ125

resolved Apr 6

Resolved

YES

ALL

This paper was recently posted to the ArXiV: https://arxiv.org/abs/2503.21934

It claims that SOTA LLMs achieved surprisingly low scores on this year's USAMO, achieving less that 5% on average.

In some discussions of this paper I've seen AI defenders claim that the paper is fake.

This weekend I will look into the paper's methodology, try to recreate their results with the models I have access to (Deepseek, o3 mini, Claude 3.7 thinking) if things are still unclear, and determine whether I think the paper's results are substantially true.

Possible resolutions are:

100%, the paper's results seem basically correct.
80%, the main thrust is correct but it seems like models performed particularly badly in their tests or they graded unnecessarily harshly.
50%, I am more confused than I am now and don't form an internal consensus.
20%, the paper's results are substantially, but not wholly, incorrect in my view.
0%, this seems like a fake paper to me/is completely wrong

My credentials and current epistemic status:

Former USAMO competitor and current PhD student in math.
Significant AI skeptic compared to most of Manifold, but probably not compared to general population.
The results of this paper were surprising to me, I would have expected much better performance.

Since the resolution criteria are subjective, I will not trade in this market.

Get

1,000

and

1.00

3 Comments

Sort by:

Resolved! My results/notes from looking into this, for review and/or critique:

I got responses that to me seemed similar to the ones achieved in the paper, so I don't think contamination was a major issue.

The grading criteria seemed basically correct for standard Olympiad scoring.

I got useful first steps on P1 and P4 noticeably more often than I would have expected for a 100% resolution, but the follow-through was noticeably and correspondingly worse on those as well.

My p(AGI) has gone down slightly! My rough guess is that there's a procedural component to problem-solving than I realized, which computational approaches do well, and there's an intelligence/reasoning-based component, which they do poorly. LLM progress is often seen as progress on the reasoning-based component, and maybe is to some extent, but it could also just be showing that the procedural component is bigger than we realize. On P1 in particular, it really shouldn't be difficult at all to go from a closed form for the digits to a full solution, and no model from the study does it ever afaict. My AGI update concretely is: intelligence is narrower than I realized, and can be approximated better than I realized by semantic pattern-recognition, but is further out than I realized as well.

Gemini 2.5 consistently gets P1 and P4, which is good! I do not really know how to interpret this fact, though. Subjectively, there's not enough playing-around in the CoT? Like, the model almost always wants to go simple case -> conjecture -> full proof, which is a fine strategy but not the only one. A much better strategy is almost always taking a complex thing (like n^k in base 2n) and breaking it down into parts that can be fiddled around with, and the models all really hate fiddling anywhere except the beginning of their thought process, at least as far as CoT reflects.

(Also: the amount of intelligence needed to turn the idea for P1/P4 into a proof is really a very small sliver of a shadow of something, it's not hard. Graphs showing that performance from, say, o3-mini to Gemini 2.5 jumps from 0% of possible points to 25% of possible points are not reflecting that we are a quarter of the way to 100%, do not believe that they are.)

Overall, conclusion of the paper seems more correct than I expected even conditioned on me finding it largely correct, so 100% seems like the appropriate resolution.

Aesthetic and non-resolution notes: P1 is worth playing around with if you are computer-brained. P6 is gorgeous (apparently problems like this are now more common in contest math, but they weren't when I was competing). P5 is also beautiful and worth a look if you are already experienced in contest math. P3 and P4 were not really to my taste, mostly I just hate drawing, P2 I'm not sure about yet.

How do you plan to avoid contamination when attempting to recreate the results?

@zsig This is a bit tricky and I'm not very confident in my ability to do it. Current plan is just to compare my results to the results from the paper and note the major discrepancies, then look through published solutions to see if those discrepancies are plausibly a result of recent training data. In principle unless one of these models updates their knowledge cutoff this shouldn't be an issue, but who knows what's really going on behind the scenes?

Related questions

Related questions