Will AI pass the Winograd schema challenge by the end of 2024?

Plus

Ṁ1917

resolved Dec 10

Resolved

ALL

https://en.wikipedia.org/wiki/Winograd_schema_challenge

Resolves positivly if a computer program exists that can solve Winograd schemas as well as an educated, fluent-in-English human can.

Press releases making such a claim do not count; the system must be subjected to adversarial testing and succeed.

(Failures on sentences that a human would also consider ambiguous will not prevent this market from resolving positivly.)

/IsaacKing/will-ai-pass-the-winograd-schema-ch

/IsaacKing/will-ai-pass-the-winograd-schema-ch-1d7f8b4ad30e

/IsaacKing/will-ai-pass-the-winograd-schema-ch-35f9dca7fa7d

/IsaacKing/will-ai-pass-the-winograd-schema-ch-d574a4067e75

Update 2025-12-09 (PST) (AI summary of creator comment): Current performance benchmarks:
- GPT-4: 87.5% accuracy
- Human baseline: 94% accuracy

For reference, see the leaderboard mentioned in the creator's comment.

#AI

Get

1,000

and

1.00

9 Comments

Sort by:

Top spot on this leaderboard is GPT-4 at 87.5%, compared to humans at 94%.

This paper claims 91%.

That paper is from 2021, so it seems likely to me that a newer thinking model that's designed specifically for this sort of problem could break 94%. But I can't find any evidence of this actually having happened, and general-purpose thinking models do not seem capable of this. (Not to mention that developments this year don't count, this market ended at the end of 2024.) So I'm resolving NO.

Gemini failed the first one I tried. This is from the original public dataset, so it's surely in its training data!

ChatGPT makes it a little further, failing on the third one

Claude got through 7 correctly, and Gemini with thinking turned on got through 3 before I ran out of usage. ChatGPT with thinking turned on continues getting this one wrong, however.

Should this count? I think an educated human can do much better than 90%. I think to resolve YES there needs to be something reaching ~99%.

@IsaacKing

@IsaacKing how would you figure out whether this market resolves YES? if you want to give some ai like claude newsonnet a few winograd schemas, it's clear it can solve them correctly

Related questions

Related questions