Will superposition in transformers be mostly solved by 2026?

Plus

Ṁ17k

2026

73%

chance

ALL

Superposition is a hypothesized mechanism for polysemanticity. It is a major bottleneck for interpretability. There are groups working on reducing it, most notably Chris Olah's group at Anthropic. However, it is possible that reducing superposition is hard, or that superposition is not an accurate model of polysemanticity.

The following would qualify for a YES resolution:

A modified transformer architecture that, when trained, has at most 50% of the superposition than an iso-performance regular transformer
A method for reading out features in superposition from a regular/modified transformer that is able to recover at least 50% of features in superposition

The following would qualify for a (pre-2026) NO resolution:

Only a small fraction of features can be recovered (<50%)
Superposition is shown conclusively to be an invalid model of polysemanticity

In the event that it is unclear how many features are actually in superposition (there could hypothetically be an absurd number of near-orthogonal vectors), only preliminary (and not necessarily conclusive) evidence that the remaining possible directions are not relevant is sufficient to rule them out from consideration.

#️ AI Alignment

#Mechanistic interpretability

Get

1,000

and

1.00

15 Comments

Sort by:

🤨🤨🤨🤨🤨🤨🤨🤨🤨🤨🤨

As a clarification: the method should have to also demonstrably meet the 50% criterion for at least transformers of nontrivial size (GPT-2 as a lower bound), and it should appear plausible that it will scale to frontier transformers (for example, a scaling law demonstrating continued improvement would satisfy this condition). So a one layer transformer will not qualify. I think this is the most natural interpretation of the title--"superposition in transformers" implies transformers in some degree of generality.

@LeoGao Also, additional clarification: >50% variance explained by an autoencoder will not qualify for the >50% of features requirement

I feel like I am missing something important here

@firstuserhere It was linked in https://twitter.com/daniel_271828/status/1710437181234540925

Why... is this spiking? Is this Chris Olah's comments about being excited about interp, and it being more of an engineering problem, in his opinion?

predicts YES

Is this for any transformer? How does this resolve if we have an expensive technique that has been validated on small transformers but hasn't been successfully applied to very large transformers?

Cases I'm interested in:
- It satisfies the market criteria for at least one small transformer and it's reasonable to think the best technique in 2026 would work on large transformers if we had really good hardware we currently don't have
- It satisfies the criteria with a small transformer and it's reasonable to think it would work on large transformers but it would be expensive and noone's tried it yet.
- It satisfies the criteria with a small transformer and preliminary results for larger transformers are mixed/don't satisfy criteria of market.

where small is something between ~8M-1B parameters

Noa NabeshimaboughtṀ200YES

@NoaNabeshima elaborate the excitement?

predicts YES

@BartholomewHughes I didn't think carefully about the actual probability, I think I'm not trying to be a very good predictor on this market fwiw. I've been doing some superposition stuff with some promising early results and attending to public stuff. My main story for this resolving Yes is that Anthropic succeeds. I think trading against me isn't unreasonable. Part what's going on here for me is: just enjoying the feeling of being bullish and (?) incentive to do a good job (seems silly but that's what it's actually like for me)

predicts YES

@BartholomewHughes also 3.5 years is a long time

50% of features in what sized model?

Interesting question! It honestly wouldn't surprise me if SoLU has at most 50% of the superposition of a normal model, though it's really hard to quantify. My guess is that removing superposition is impossible, but that being able to recover many features is doable-ish, though 50% is a high bar. My best guess for this breaking is just that we never figure out how to quantify the number of features.

What if a better model than superposition is discovered but superposition kiinda sorta still can fit with some contortions and tweaks

predicts NO

@NoaNabeshima If it explains more than half the features or variance or something then I'd resolve yes

Comment hidden

Related questions

Related questions