Will AI pass the Winograd schema challenge by the end of 2023?
Mini
33
Ṁ2394
resolved Apr 16
Resolved
NO

https://en.wikipedia.org/wiki/Winograd_schema_challenge

Resolves positivly if a computer program exists that can solve Winograd schemas as well as an educated, fluent-in-English human can.

Press releases making such a claim do not count; the system must be subjected to adversarial testing and succeed.

(Failures on sentences that a human would also consider ambiguous will not prevent this market from resolving positivly.)

/IsaacKing/will-ai-pass-the-winograd-schema-ch

/IsaacKing/will-ai-pass-the-winograd-schema-ch-1d7f8b4ad30e

/IsaacKing/will-ai-pass-the-winograd-schema-ch-35f9dca7fa7d

/IsaacKing/will-ai-pass-the-winograd-schema-ch-d574a4067e75

Get
Ṁ1,000
and
S1.00
Sort by:

Apparently 90% accuracy was reached in 2019.

https://www.sciencedirect.com/science/article/abs/pii/S0004370223001170

A human should be able to do much better than 90% though, so I'm inclined to still resolve this NO.

I just tested GPT-4 on the original benchmark and it could not even get 90%, despite having been trained on at least some of them.

predicted NO

gpt 4 is closer! 87.5% now from 81.6% with GPT 3.5

I believe SmartGPT + Prompt engineering can theoretically do it. Whether it is proven that it is equal to a fluent human in 2023, is a different matter.

predicted NO

Some interesting discussion here.

What do you mean by adversarial testing? The Winograd schema challenge is a defined benchmark, are you asking about something different?

predicted NO

@vluzko I just mean that I want to be sure that it can actually pass. Also, if its training data includes the existing Winograd sentences, then I'd want to give it different ones.

@IsaacKing but what do you mean by making sure? E.g., are you sure GPT-4 passed the benchmarks that OpenAI said it did? And given the popularity of Winograd, could you really exclude the benchmark from training? Do you mean you want to have enough access to run your own version?

predicted NO

@JacyAnthis No, if OpenAI provides a description of an experiment with enough detail that it seems this should resolve YES, I'll believe them unless someone provides good evidence I shouldn't.