Will the ARC-AGI Grand Prize be claimed in 2024?
💎
Premium
311
Ṁ890k
resolved Dec 6
Resolved
NO

https://arcprize.org/competition
>=85% performance on Chollet's abstraction and reasoning corpus, private set. As judged by Chollet et al.

2025 version: https://manifold.markets/JacobPfau/will-the-arcagi-grand-prize-be-clai-srb6t2awj1

Get
Ṁ1,000
and
S1.00
Sort by:

The current leaderboard likely underrepresents the progress being made towards achieving the 85% threshold.

I think any winning approach will likely involve synthesizing many new puzzles for training/eval, and it will be easier and of strategic advantage to test systems on synthesized puzzle sets rather than going through the submission process.

One might only make a couple of submissions in order to model the relationship between scores on one's synthesized dataset vs the official one, and then "go stealth" until one has achieved a system whose score on the synthesized dataset would indicate a score >85% on the official dataset.

I think that can give 10% chance.

The current leaderboard implies close to 0%.

Now that I've bet a not-insignificant amount of mana on this, can someone explain why anyone would bet in favor of this? From what I understand, no one is even close, we wouldn't see an award even if the number of correct answer doubled and the machines probably got the easiest questions right so the remaining way is even harder.

Is this basically a bet on whether someone cheats? I filled a fuckton of limit orders at 20 percent odds and it feels like the odds would be optimistic at, like, 8?

I'm already invested, someone tell me what I'm missing.

The minds ai folks have been making some pretty impressive progress. Hit 60% on the public set recently.

I agree the market is a bit high now, but I can see someone getting excited just from the progress over the past month.

I suspect the public set got into the training data. But yeah, I feel like all the ai markets are a bit distorted atm

I suspect the public set got into the training data

Unlikely. Mind folks are testing against custom made sets and see similar numbers.

Oh shit they might actually have something here

That said: that last 25 percent is likely going to be the hardest, and we got 6 months. I'm pretty comfortable in my no position

So, writing out my reasoning for buying up to 15%:
- I think how well people performed in the past is not very reliable evidence as to the future, since the 1 million dollar grand prize and recent publicity make it way more likely to receive serious effort.
- Very few benchmarks exist where models under perform median humans and I don't find arguments as to why this one would be very different are compelling.
- Claude Sonnet 3.5 significantly outperforms GPT-4o without scaffolding so I'm suspicious that either scaling or post-training modifcations are in fact pretty helpful for this benchmark.
- LLAMA 3 400b seems likely to be released this year and potentially helpful.

Good reasoning I think!

My (light) hypothesis is that coding challenges are a good microcosm of basic logic tests of LLM's: they ace old ones, and fail miserably in new ones, which makes me thing there's direct or indirect memorization at play, basically they're overfit. The testing results so far makes me thing that the test is not solvable within the training data.

The million dollar thing being new is something I wasn't aware of and effects my willingness to dig myself deeper into a NO position. Thanks, this was a good comment!

https://livecodebench.github.io/ and https://livebench.ai/ suggest that the memorization problem exists but isn't extreme for most of the largest foundation models, many of which perform almost as well on benchmarks released before and after they were trained afaict.

I can tune back in later with something more to backup what I'm saying, who knows I might be behind or it was just a fabrication I believed: but if my memory serves there are coding challenges ranked easy to hard where old questions are answered with 100 percent accuracy and on new ones it drops to 0. This was GPT 4 if Im not mistaken. My model of the universe at this moment is one where LLM's don't memorize some things, they're nothing but memorization with some flexibility around semantics.

That said, I might be wrong

i dont think this will happen but i think if we look into open ai Q* / strawberry leaked papers then they are working on an intelligence that can actually think, and reason, so i think if they succeed then they will be able to get this price, without cheating. but i don't think they will succeed in 2024 as its a very fundamental and underlying problem of general intelligence, needed for true problem solving skills and not just prediction based on previous examples but also the ability to figure out new novel stuff on the fly, this would also lead to exponential scientific discoveries imo but lets see if they do it... illya Satskuver also made his company to so exactly this, super intelligence. oh well

reposted

I made a version of this market which allows for closed source LLMs: https://manifold.markets/RyanGreenblatt/by-when-will-85-be-reached-on-the-p

Here’s someone claiming 100% accuracy on the eval set with a from-scratch transformer: https://x.com/spatialweeb/status/1803950481422848312?s=46&t=fdgdiEzkLwQ2qvItoWggvg

(Doubt this holds up under scrutiny, likely a bug somewhere.)

From the replies, it looks like they were accidentally including the answer along with the examples.

I think you can a priori assign very low probability on this kind of stuff. If GPT4 and other models that took a 100s of millions of $ compute and a ton of very good engineers and only got to mid 30s on ARC, it's very unlikely that 1 person will just think of 1 trick that solves deep reasoning and gets to 100%.

bought Ṁ250 NO

Betting no based on the difficulty of YES resolution, in particular requiring models to work offline.

Can't use Gemini in the challenge. It has to run offline and with a maximum runtime of 12 hours on kaggle.

They make it sound a lot more interesting than it really is. They used 1000x the compute of prior sota to achieve the same results. The real ARC challenge is limited in compute to 12 hours runtime on kaggle and has no internet access (so no access to large LLMs).

Note that this prize doesn't allow for close source models to be used in doing the actual task.

Of course, distillation is possible etc.

bought Ṁ100 YES

Ah, I should have read this comment before I made a bet on yes 😅

"No ARC human baseline exists! http://arcprize.org/arc: "most humans can solve on average 85% of ARC-AGI tasks." But this study used the train set http://arcprize.org/guide: "The public training set is significantly easier than the...public evaluation and private evaluation set""

I tried solving about 20 public test set problems and they were all pretty easy as well. I don't know what the average human would get, but I doubt it would be much lower than 85%.