Will mechanistic interpretability be essentially solved for GPT-3 before 2030?
➕
Plus
37
Ṁ14k
2030
14%
chance

Mechanistic interpretability aims to reverse engineer neural networks in a way that is analogous to reverse engineering a compiled binary computer program. Achieving this level of interpretability for a neural network like GPT-3 would involve creating a binary computer program that is interpretable by expert human programmers and can emulate the input-output behavior of GPT-3 with high accuracy.

Before January 1st, 2030, will mechanistic interpretability be essentially solved for GPT-3, resulting in a binary computer program that is interpretable by ordinary expert human programmers and emulates GPT-3's input-output behavior up to a high level of accuracy?

Resolution Criteria:

This question will resolve positively if, before January 1st, 2030, a binary computer program is developed that meets the following criteria:

  1. Interpretability: The binary computer program must be interpretable by ordinary expert human programmers, which means:
    a. The program can be read, understood, and modified by programmers who are proficient in the programming language it is written in, and have expertise in the fields of computer science and machine learning.
    b. The program is well-documented, with clear explanations of its components, algorithms, and functions.
    c. The program's structure and organization adhere to established software engineering principles, enabling efficient navigation and comprehension by expert programmers.

  2. Accuracy: The binary computer program must emulate GPT-3's input-output behavior with high accuracy, as demonstrated by achieving a maximum average of 1.0% word error rate compared to the original GPT-3 model when provided with identical inputs, setting the temperature parameter to 0. The accuracy must be demonstrated by sampling a large number of inputs from some diverse, human-understandable distribution of text inputs.

  3. Not fake: I will use my personal judgement to determine whether a candidate solution seems fake or not. A fake solution is anything that satisfies these criteria without getting at the spirit of the question. I'm trying to understand whether we will reverse engineer GPT-3 in the complete sense, not just whether someone will create a program that technically passes these criteria.

This question will resolve negatively if, before January 1st, 2030, no binary computer program meeting the interpretability and accuracy criteria is developed and verified according to the above requirements. If there is ambiguity or debate about whether a particular program meets the resolution criteria, I will use my discretion to determine the appropriate resolution.

Get
Ṁ1,000
and
S1.00
Sort by:
bought Ṁ100 YES

I think there's a decent chance there's TAI before 2030, in which case AGIs could help us here. That said, this does seem like a really hard challenge, even for an AGI.

Before I bet this was only one percentage point lower than the analogous question for GPT-2, which seemed wild to me.

predicts NO

@JeanStanislasDenain Hmmm maybe not completely wild. In any case I think 25% was way too high.

I'm at >95% that this is literally impossible for human programmers.
It seems like this would be totally crazy for this to be possible.