Will mechanistic interpretability have more academic impact than representation engineering by the end of 2025? | Manifold

Will mechanistic interpretability have more academic impact than representation engineering by the end of 2025?

Mini

7

Ṁ143

Jan 2

72%

chance

1D

1W

1M

ALL

Measured by number of new citations, will a paper on mechanistic interpretability have more academic impact than a paper on representation engineering by the end of 2025?

I'd expect a mech interp paper to be or have methodology originating from Transformer Circuits and related and/or use relatively low-level units of analysis (e.g., at least as small as small groups of attention heads) to explain a model algorithm. Causal intervention work included here as well.

For representation engineering I'd expect a top-down approach along the lines of Burns et. al. 2022, Turner et. al. 2023, the RepE paper, or https://arxiv.org/abs/2206.10999. Such a paper would probably use similar unsupervised methodology. I would probably include parts of previous NLP work here, e.g. https://arxiv.org/abs/2004.07667 or https://arxiv.org/abs/2309.07311, and parts of the model similarity literature.

#Mechanistic interpretability

Get

1,000

and

1.00

Related questions

Will interpretability be commonplace in physics papers relying on machine learning by the end of 2025?

Will mechanistic interpretability be essentially solved for the human brain before 2040?

Will mechanistic interpretability be essentially solved for GPT-3 before 2030?

Breakthrough in symbolic regression by the end of 2025?

Will a model costing >$30M be intentionally trained to be more mechanistically interpretable by end of 2027? (see desc)

By 2035, will mechanistic interpretability enable Nobel Prize-winning work?

Will mechanistic interpretability be essentially solved for GPT-4 before 2030?

Will mechanistic interpretability be essentially solved for GPT-2 before 2030?

Will ASCII diagrams be replaced by interactive visualizations in MCP by EOY 2025?

Will "How To Become A Mechanistic Interpretability ..." make the top fifty posts in LessWrong's 2025 Annual Review?

Related questions

Will interpretability be commonplace in physics papers relying on machine learning by the end of 2025?

By 2035, will mechanistic interpretability enable Nobel Prize-winning work?

Will mechanistic interpretability be essentially solved for the human brain before 2040?

Will mechanistic interpretability be essentially solved for GPT-4 before 2030?

Will mechanistic interpretability be essentially solved for GPT-3 before 2030?

Will mechanistic interpretability be essentially solved for GPT-2 before 2030?

Breakthrough in symbolic regression by the end of 2025?

Will ASCII diagrams be replaced by interactive visualizations in MCP by EOY 2025?

Will a model costing >$30M be intentionally trained to be more mechanistically interpretable by end of 2027? (see desc)

Will "How To Become A Mechanistic Interpretability ..." make the top fifty posts in LessWrong's 2025 Annual Review?

Terms & Conditions•Privacy Policy•Sweepstakes Rules