What will be o3's score on FrontierMath?
➕
Plus
29
Ṁ18k
May 31
93%
Less than 30%
2%
30% - 35%
1.2%
35% - 40%
0.9%
40% - 45%
0.9%
45% - 50%
1.5%
At least 50%

OpenAI has announced a model named o3. What will be the score of this model on FrontierMath?

Resolution is based on the score OpenAI publicly claims for o3 after its release. If there are multiple scores (e.g. for various levels of inference-time compute), the highest one will be used. Tool usage, including running Python and accessing the web, is allowed.

If OpenAI makes no claims about o3's score within two weeks of release, I'll use my best judgment.

I will trade on this market.

Note: There have been prior claims about o3 achieving a score of 25.2% on FrontierMath. However, note that this market is concerned about claims made in association with the public deployment of (a possibly further refined version of) o3; it's plausible that these scores are much higher, and hence a market on this is of interest. The prior 25.2% claim is irrelevant for the resolution of this market.

Note: EpochAI has a holdout subset of the FrontierMath benchmark. This is not within the scope of this market. That is, if both OpenAI and EpochAI announce scores for o3, I will resolve based on the OpenAI score.

For reference, if this market had been about o3-mini rather than o3, this market would have resolved 32%, based on the information in OpenAI's blog post.

Get
Ṁ1,000
and
S1.00
Sort by:
bought Ṁ15 At least 50% YES

people are way too confident about bucket 1

@Loppukilpailija OpenAI has released a model card but opted not to use FrontierMath. https://openai.com/index/introducing-o3-and-o4-mini/

Epoch's evals show it to be at 10%.

@MingCat Thanks. I will wait for the "two weeks of release" in case OpenAI gives results for FrontierMath.

It is unfortunate if we have to rely on the EpochAI results, since their scores are substantially lower than those claimed by OpenAI and so the comparisons are not apples-to-apples. But if no further information comes out, I suppose it's fair to assume that o3 hasn't substantially improved. Less than 30% seems fair in this case.