In 2025, what 2019-2022 work of AI safety will I think was most significant?
Mini
11
Ṁ328
Jan 1
18%
Other
6%
Risks from Learned Optimization in Advanced Machine Learning Systems https://arxiv.org/abs/1906.01820
6%
Constitutional AI: Harmlessness from AI Feedback https://www.anthropic.com/constitutional.pdf
1.8%
Other Not Listed Here
1.3%
Optimal Policies Tend to Seek Power https://arxiv.org/abs/1912.01683

Works to be considered include Arxiv papers first appearing in this time window, Lesswrong posts, and paper-like posts (mainly to include Anthropic papers). This time window includes both 2019 and 2022. 'Significant' here means was contributed the most to progress towards AI alignment and AI safety. This is obviously very subjective.

If I were to answer this question for papers 2016-2019, possible answers would have included, among others, 'AI safety via debate', 'The off switch game'.

Get
Ṁ1,000
and
S1.00
Sort by:

Relevant:

None of those.

@Lauro Is this intended to include the heuristic arguments work?

@JacobPfau For the purposes of this question I'll include the associated Arxiv paper under the "Mechanistic Anomaly Detection" option.

@JacobPfau ah yeah I agree it makes sense to include heuristic arguments here!