By 2028 will we be able to identify distinct submodules/algorithms within LLMs?

Plus

Ṁ1620

2028

76%

chance

ALL

Roughly: will we be able to examine an LLM and extract some identifiable sub-module accomplishing an understandable task (e.g. "addition" or "inference on some decision tree" or "quicksort"). For instance it could be some set of neurons from layers L_1, ..., L_k that when run on its own executes the specified algorithm.

It must also be demonstrated that the LLM actually uses the submodule in some interpretable way. e.g. if the module implements quicksort, a demonstration might be that modifying the module to implement reversed quicksort causes the LLM to produce reverse sorted data when asked for sorted data.

The work must be done for an LLM at least as capable as OPT-3 66B.

The work must identify at least 10 submodules, or identify at least one while proving that no others exist.

If it turns out that the question is ill-posed in a way that can't be fixed with some minor tweaks, I'll resolve N/A.

Up until 2026 I may refine the criteria here, either in response to feedback from predictors or future research giving me a better way to ask the question.

#AI

#Technical AI Timelines

#AI Safety

#Technical AI Safety

#Mechanistic interpretability

Get

1,000

and

1.00

3 Comments

Sort by:

How well-defined do the sub modules need to be?

I am sure that it's possible to find subnetworks that are activated more for certain types of tasks, but I don't expect these to be cleanly demarcated. I expect there to be a lot of nodes and edges that partially contribute, where if you exclude all of these partial-contributions, the network can't do the task, but if you include all of them, you're including most of the network.

@jonsimon Most likely that would resolve NO, but it would depend on exactly how tangled up everything is.

@vluzko If it doesn't work in the neuron basis, but does work in a learned basis discovered by e.g. using sparse autoencoders, does that count?

Also, how complex does the "module" have to be? Would a "module" that does date math (e.g. "today is January 19th, in exactly 2 weeks it will be" count?

Related questions

Related questions