
Specifically, the key benchmarks here are ARC, Codeforces elo, and Frontier Math score. The relevant scores are 2727 codeforces elo, 87.5% on arc semi-private, and 25.2% on Frontier Math.
The model must achieve these benchmarks while using no more than 1,000,000 reasoning tokens per question on average.
For context, o3 used 5.7B tokens per task to achieve its ARC score. It also scored 75.7% on low compute mode using 33M tokens per task.
https://arcprize.org/blog/oai-o3-pub-breakthrough
Also note that if the final version of o3 has improved or worsened benchmarks the goalposts will not change. The model must beat the benchmarks listed here.
@Bayesian I'm just gonna resolve yes. Doing otherwise would feel unfair to the holders. Strictly speaking the thing mentioned in the question did happen.
Though actually since o3 did slightly worse on some of the benchmarks than were announced in December you could argue that it doesn't resolve yet since no model has achieved those scores