In 2028, will Gary Marcus still be able to get LLMs to make egregious errors?

Plus

355

Ṁ130k

2028

53%

chance

ALL

Resolves positively if Marcus (or someone else fulfilling his role) can find three extremely obvious questions, that an average human teenager could certainly answer, which a leading chatbot still fails at at least half the time when asked.

This won't resolve positively if he has to use bizarre hacking-like tricks, for example things equivalent to the SolidGoldMagikarp token.

#AI

#ACX

#Scott Alexander's 5 year predictions

Get

1,000

and

1.00

25 Comments

Sort by:

METR with constant hazard rate suggests models will be very reliable at 1-10 min tasks by then roughly 99.9% success

Here he claims that solving the Towers of Hanoi with 8 discs is something a bright and patient 7 year old can do: https://open.substack.com/pub/garymarcus/p/a-knockout-blow-for-llms

This feels a bit more "useful" than the trick questions in SimpleBench. That said I'm not sure the average human teenager could solve Hanoi.

Big difference here between "could" and "would". Out of curiosity I tried doing one with 8 disks to see if I ever made a mistake, and I did not, but it was quite tedious. As models improve they'll surely beat 8 disks and might only start failing on 10 or 12, and ask the average teenager to sit through that and it's not happening. Maybe if you offered to pay them for every correct 100 moves, and let them take breaks in between.

@IsaacKing I think if a model can solve the 8 disk tower of hanoi it can also solve 10 or 12 because the tower of hanoi is based on recursion. Similarly, if there is somebody who knows how to solve the 8 disk tower of Hanoi, they could also solve the 10 or 12 disk version. It can also be easily proven that the number of moves for the tower of Hanoi is 2^n - 1 where n is the number of disks. This means that the number of moves for the tower of hanoi grows exponentially. I agree with you on the fact that "could" and "would" is a big difference and the average teenager will not sit through a tower of Hanoi puzzle with many disks but I disagree that somebody/something that knows how to solve the 8 disk version will fail on the 10 or 12 disk version.

@ZandaZhu You should read the paper this thread is discussing then.

Claude 4 Opus gets 58% on SimpleBench, a ways off from saturation, but not two years off (and the hardest SB questions do not appear to be "extremely obvious"). Giving LMs code execution solves strawberry shenanigans and the 9.11 stuff. If we're talking about text-only queries, what are the remaining classes of "egregious errors" that LLMs continue to make?

@AdamK inventing new pieces during a game of chess.

@TiredCliche I did that as a teenager playing blindfold chess.

@MartinRandall Perhaps, but it seems odd to call this blindfold chess.

@TiredCliche I'm not sure if the "average teenager" can play legal chess moves from an ASCII art chat window. It's not blindfold but it's also not the same as a physical chess set or a digital chess game. And the average teenager hasn't played chess for over a year.

@MartinRandall I just don't think the average teenager from an ASCII art chat window, given reference to the rules of the game, would repeatedly try to invent new pieces.

But I don't think that matters a ton, I am not under the impression that LLMs can play legally even given image data. I suspect they might actually get more confused.

@MartinRandall telnet freechess.org 5000

bought Ṁ50 YES

@AdamK LLMs can solve things like strawberry and 9.11 with code but that doesn't mean they will do so if you ask the question without instructing them to use code. these sorts of mistakes still pop up sometimes and would count for this market.

@JoshYou AI Explained says he thinks Simple Bench won't last more than "3-12months maybe?"

7:15 in this video: https://youtu.be/jWsd2fRzpUo

Reasoning models seem to address a lot of these. I don't see o3 failing on his recent gotchas. He could come up with new ones, but they're already pushing up against the limits of a normal teenager.

Plus we're 3 years from this resolving and 2.5yr since the release of chatgpt

bought Ṁ7,000 NO at 39%

bought Ṁ7,000 NO

@Mactuary I'm generally betting on slower AGI timelines but from my own experience with o3, I agree. I think there's uncertainty on how this would resolve today, let alone in 2028.

@dreev

@FergusArgyll I'll buy some No on that

@Mactuary Read the comments there!

o3 (and all SOTA llms) are very impressive and useful but still very easy to trip up

What type of LLMs, @ScottAlexander ?
- Transformer based? SSMs? MOEs?
  - What if transformer based LLMs are no longer the SOTA by then = /firstuserhere/on-january-1-2027-a-transformerlike-d56426e3f49e ?
  - Architecture invariant?
- Would a black box system qualify, where it is known that one of the components of the system is a component to filter for things that may trip LLM up?
What would happen if the prompt that Gary marcus passes to the LLM does not reach the LLM?
- i.e. it is modified on the way from his user-input (such as how DALLE-3 or Claude Opus write prompts)

i think scott is reasonably excluding token parsing errors which are orthogonal to llm reasoning capability. it's a quirk of conversion to embeddings and not a high priority one for openai to fix.

perhaps the unreasonable part is where he didn't explain his thought process. but people get busy

this market and friends would probably be better off as a poll due the legion amount of ambiguities.

I'm about 99% that this market and others of this ilk will resolve this based on how folks are vibing at the time.

ie: don't take them too seriously.

If you are interested in creating a serious market, take a look at openai/evals. Some stuff there could be used (including my grade school algebra questions! :)

predicts YES

Doesn't seem we're getting clarification on this, so I've made a duplicate of this market that removes the "bizarre hacking like tricks" exception.

predicts YES

@ScottAlexander Can we get some more clarity on this market? What counts as "bizarre hacking like tricks"? If there's a question with very specific wording that a human would understand but the LLM fails, how is that counted?

"What is the last letter of 'solidGoldMagickarp'?" is a pretty straightforward question for a human, so it seems weird to be artificially excluding it, and I don't know how to predict what else is likely to be excluded.

In 2028, will LLMs be able to get Gary Marcus to make egregious errors?

predicts NO

@YuxiLiu mildly wanting to make an actual question on this, the problem is operationalizing "egregious errors". Gary Marcus is unlikely to admit to his own egregious errors.

Related questions

Related questions