This is a regularly updated list of resources for getting up to speed on the main ideas in AI alignment (ensuring advanced AI systems do what we want them to do). This is a selected list that is not intended to be comprehensive. If there are any resources that you think would be important to add to the list, please let me know via this form. See the Alignment newsletter for summaries of many resources on this list as well as more extensive references.

[Last updated in Jan 2024. Thanks to Beth Barnes and Neel Nanda for feedback and suggestions.]

Motivation

For a popular audience
For a technical audience
Threat models

Overviews

Strategy and governance
Technical overviews
Research agendas
Research motivations
Books

Technical work

Alignment components
- Outer alignment
- Inner alignment
Alignment enablers
Other technical work

Projects

Requests for proposals
Project ideas

Careers

Career advice
Programs and courses

Communities

Motivation

For a popular audience

Ajeya Cotra and Kelsey Piper. Planned Obsolescence blog.

Katja Grace (2023). AI is not an arms race. How AI development is different from a classic arms race and why moving too fast can have catastrophic outcomes for humanity.

Will Knight (2023). What Really Made Geoffrey Hinton Into an AI Doomer. Interview with Geoffrey Hinton on his concerns about humanity losing control of AI.

Holden Karnofsky (2022). AI Could Defeat All Of Us Combined and Why Would AI “Aim” To Defeat Humanity?. The first post argues that advanced AI would be the first technology with the capability to overthrow humanity, and that this alone is a cause for concern (independently of how likely it is AI systems would want to do so). The second post explains why advanced AI might end up directed towards that goal.

Kelsey Piper (2018). The case for taking AI seriously as a threat to humanity.

Tim Urban (2015). Wait But Why: The AI Revolution. An accessible introduction to AI risk forecasts and arguments (with cute hand-drawn diagrams, and a few corrections from Luke Muehlhauser).

Open Philanthropy (2015). Potential risks from advanced artificial intelligence. An overview of AI risks and timelines, possible interventions, and current actors in this space.

Future of Life Institute (2015). Benefits and Risks of Artificial Intelligence.

For a technical audience

Robert Miles. AI Safety YouTube channel. Introductory videos on challenges in AI safety.

Yoshua Bengio (2023). How rogue AIs may arise. Presents a set of definitions, hypotheses and resulting claims about AI systems which could harm humanity and discusses the possible conditions under which such catastrophes could arise.

Daniel Eth (2023). The Need For Work On Technical AI Alignment. Introductory explainer on the alignment problem and the need for AI alignment research.

Katja Grace (2022). Counterarguments to the basic AI x-risk case.

Ajeya Cotra (2021). Why AI alignment could be hard with modern deep learning.

Richard Ngo (2020). AGI safety from first principles.

Stuart Russell (2019). Provably Beneficial AI. A video of Russell’s talk introducing the problem of AI alignment as we build towards provably beneficial AI.

Scott Alexander (2015). No time like the present for AI safety work. An overview of long-term AI safety challenges, e.g. preventing wireheading and formalizing ethics.

Stuart Russell (2014). On Myths and Moonshine. Russell’s response to the “Myth of AI” question on Edge.org, which draws an analogy between AI research and nuclear research, and points out some dangers of optimizing a misspecified utility function.

Stuart Armstrong (2014). Smarter Than Us: The Rise Of Machine Intelligence. A short ebook discussing potential promises and challenges presented by advanced AI, and the interdisciplinary problems that need to be solved on the way there.

Steve Omohundro (2007). The basic AI drives. A classic paper arguing that sufficiently advanced AI systems are likely to develop drives such as self-preservation and resource acquisition independently of their assigned objectives.

Threat models

Ngo, Chan, Mindermann (2022). The alignment problem from a deep learning perspective.

Ajeya Cotra, 2022. Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover.

Nate Soares, 2022. A central AI alignment problem: capabilities generalization, and the sharp left turn.

Joe Carlsmith, 2022. Is power-seeking AI an existential risk?.

Neel Nanda, 2021. My Overview of the AI Alignment Landscape: Threat Models.

Paul Christiano, 2019. What failure looks like.

Overviews

Strategy and governance

Ho, Barnhart, et al (2023). International Institutions for Advanced AI.

Anderljung, Barnhart, et al (2023). Frontier AI Regulation: Managing Emerging Risks to Public Safety.

Chan, Salganik, et al (2023). Harms from Increasingly Agentic Algorithmic Systems.

Brundage, Avin, et al (2020). Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims.

Brundage, Avin, et al (2018). The Malicious Use of AI: Forecasting, Prevention and Mitigation.

Grace, Salvatier, et al (2017). When Will AI Exceed Human Performance? Evidence from AI Experts. A detailed survey of >300 authors at NeurIPS and ICML conferences, on when AI is likely to outperform humans in various domains.

Nick Bostrom (Global Policy, 2017). Strategic Implications of Openness in AI Development.

Technical overviews

Victoria Krakovna (2022). Paradigms of AI alignment: components and enablers.

Neel Nanda (2021). My Overview of the AI Alignment Landscape: A Bird’s Eye View. Overview of threat models, agendas for building safe AGI, and robustly good approaches to alignment.

Evan Hubinger (2020). An overview of 11 proposals for building safe advanced AI. Compares different variations on iterated amplification, debate, and recursive reward modeling, and evaluates them on the criteria of outer alignment, inner alignment, training competitiveness, and performance competitiveness.

Paul Christiano (2019). AI alignment landscape (EA Global).

Manheim and Garrabrant (2018). Categorizing Variants of Goodhart’s Law. Explores different failure modes of overoptimizing metrics at the expense of the real objective.

Ortega, Maini, et al (2018). Building safe artificial intelligence: specification, robustness, and assurance.

Everitt, Lea, Hutter (2018). AGI safety literature review.

FLI (2015). A survey of research priorities for robust and beneficial AI

Jacob Steinhardt (2015). Long-Term and Short-Term Challenges to Ensuring the Safety of AI Systems. A taxonomy of AI safety issues that require ordinary vs extraordinary engineering to address.

Nate Soares (2015). Safety engineering, target selection, and alignment theory. Identifies and motivates three major areas of AI safety research.

Research agendas

Kees and Janus (2023). Cyborgism. Proposes a strategy for safely accelerating alignment research, by setting up human-in-the-loop systems which empower human agency rather than outsource it, and using those systems to differentially accelerate progress on alignment.

Beth Barnes (2023). ARC evals. Evaluating the capabilities and alignment of large language models.

Christiano, Cotra, Xu (2021). Eliciting Latent Knowledge. The Alignment Research Center’s approach to training a model to report its latent knowledge.

John Wentworth (2021). Selection Theorems: A Program For Understanding Agents.

Hendrycks, Carlini, et al (2021). Unsolved problems in ML safety. Presents open problems in four research areas: robustness, monitoring, alignment and external safety.

Ajeya Cotra (2021). The case for aligning narrowly superhuman models.

Steve Byrnes (2021). Research agenda update. Brain-like AGI safety: how to solve alignment if AGI is built by deducing and implementing the algorithms used by the brain.

Dafoe, Hughes, et al (2020). Open problems in cooperative AI.

Abram Demski (2020). Learning Normativity: A Research Agenda.

Gruetzemacher, Dorner, et al (2020). Forecasting AI Progress: A Research Agenda.

Olah, Cammarata, et al (2020). Zoom In: An Introduction to Circuits. An interpretability research agenda focused on studying the connections between neurons to find meaningful algorithms in the weights of neural networks.

Critch and Krueger (2020). AI Research Considerations for Human Existential Safety (ARCHES). Research agenda on how to steer technical AI research to avoid catastrophic outcomes for humanity.

Jesse Clifton (2019). Preface to CLR’s Research Agenda on Cooperation, Conflict, and TAI.

Stuart Armstrong (2019). Synthesising a human’s preferences into a utility function. A research agenda focused on identifying partial human preferences (contrasting two situations along a single variable) and incorporating them into a utility function.

Scott Garrabrant (2018). Embedded Agents.

Leike, Krueger, et al (2018). Scalable agent alignment via reward modeling: a research direction. (blog post)

Vanessa Kosoy (2018). The Learning-Theoretic AI Alignment Research Agenda.

Allan Dafoe (2018). AI Governance: A Research Agenda.

Garrabrant and Demski (2018). Embedded agency. How can we align agents embedded in a world that is bigger than themselves? This agenda discusses and motivates the embedded agency problem and its subproblems: decision theory, embedded world models, robust delegation, and subsystem alignment.

Soares and Fallenstein (2017). Aligning Superintelligence with Human Interests: A Technical Research Agenda

Amodei, Olah, et al (2016). Concrete Problems in AI Safety. Research agenda focusing on accident risks that apply to current ML systems as well as more advanced future AI systems.

Taylor, Yudkowsky, et al (2016). Alignment for Advanced Machine Learning Systems

Research motivations

Neel Nanda (2022). A Longlist of Theories of Impact for Interpretability.

John Wentworth (2021). The plan. Reasoning behind the plan to sort out our fundamental confusions about agency and do ambitious value learning based on better understanding of agency.

Evan Hubinger (2021). How do we become confident in the safety of a machine learning system?

Paul Christiano (2020). My research methodology.

Evan Hubinger (2019). Chris Olah’s views on AGI safety.

Eliezer Yudkowsky (2018). The rocket alignment problem.

Books

Brian Christian (2020). The Alignment Problem: Machine Learning and Human Values.

Toby Ord (2020). The Precipice: Existential Risk and the Future of Humanity.

Stuart Russell (2019). Human Compatible: AI and the Problem of Control.

Eric Drexler (2019). Reframing Superintelligence: Comprehensive AI Services as General Intelligence.

Max Tegmark (2017). Life 3.0: Being Human in the Age of AI.

Nick Bostrom (2014). Superintelligence: Paths, Dangers, Strategies.

Technical work

Alignment components

Outer alignment

Scalable oversight

Burns, Izmailov, et al (2023). Weak-to-strong generalization: eliciting strong capabilities with weak supervision. (blog post)

Michael, Mahdi et al (2023). Debate Helps Supervise Unreliable Experts.

Brown-Cohen, Irving, Piliouras (2023). Scalable AI Safety via Doubly-Efficient Debate.

Beth Barnes (2021). Imitative Generalisation (AKA ‘Learning the Prior’). An approach to teaching machines to imitate how humans generalize, which addresses some of the shortcomings of iterated amplification.

Barnes and Christiano (2020). Progress on AI safety via Debate.

Christiano, Shlegeris, Amodei (2018). Supervising strong learners by amplifying weak experts. Proposes Iterated Amplification, which progressively builds up a training signal for difficult problems by combining solutions to easier subproblems.

Irving, Christiano, Amodei (2018). AI safety via debate. Agents are trained to debate topics and point out flaws in one another’s arguments, and a human decides who wins. This could allow agents to learn to perform tasks at a superhuman level while remaining aligned with human preferences. (blog post)

Reward learning

Casper, Davies et al (2023). Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.

Stiennon, Ouyang, et al (NeurIPS 2020). Learning to summarize from human feedback.

Armstrong, Leike, et al (2020). Pitfalls of learning a reward function online. Introduces desirable properties for a learning process that prevent it from being manipulated by the agent: unriggability and uninfluenceability.

Jeon, Milli, Dragan (2020). Reward-rational (implicit) choice: A unifying formalism for reward learning.

Reddy, Dragan, et al (2019). Learning Human Objectives by Evaluating Hypothetical Behavior. Introduces ReQueST (reward query synthesis via trajectory optimization), an algorithm for safely learning a reward model by querying the user about hypothetical behaviors. (blog post)

Ibarz, Leike, et al (NeurIPS 2018). Reward learning from human preferences and demonstrations in Atari.

Christiano, Leike, et al (NeurIPS 2017). Deep reinforcement learning from human preferences. Communicating complex goals to AI systems using human feedback (comparing pairs of agent trajectory segments).

Inverse reinforcement learning

Shah, Gundotra, et al (ICML 2019). On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference.

Hadfield-Menell, Milli, et al (NeurIPS 2017). Inverse Reward Design. Formalizes the problem of inferring the true objective intended by the human based on the designed reward, and proposes an approach that helps to avoid side effects and reward hacking.

Fisac, Gates, et al (ISRR 2017). Pragmatic-Pedagogic Value Alignment. A cognitive science approach to the cooperative inverse reinforcement learning problem.

Milli, Hadfield-Menell, et al (IJCAI 2017). Should robots be obedient? Obedience to humans may sound like a great thing, but blind obedience can get in the way of learning human preferences.

Hadfield-Menell, Dragan, et al (NeurIPS 2016). Cooperative inverse reinforcement learning. Defines value learning as a cooperative game where the human tries to teach the agent about their reward function, rather than giving optimal demonstrations like in standard IRL.

Evans, Stuhlmüller, Goodman (AAAI 2016). Learning the Preferences of Ignorant, Inconsistent Agents. Relaxing some assumptions on human rationality when inferring preferences.

Inner alignment

Shah, Varma, et al (2022). Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals.

Langosco, Koch, et al (ICML 2022). Goal Misgeneralization in Deep Reinforcement Learning. Introduces and provides examples of goal misgeneralization failures, which occur when an RL agent retains its capabilities out-of-distribution but competently pursues the wrong goal.

Evan Hubinger (2021). Clarifying inner alignment terminology.

Hubinger, van Merwijk, et al (2019). Risks from Learned Optimization in Advanced Machine Learning Systems. Introduces the concept of mesa-optimization, which occurs when a learned model is itself an optimizer that may be pursuing a different objective than the outer model.

Alignment enablers

Mechanistic interpretability

Bricken, Templeton, et al (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.

Lindner, Kramar, et al (2023). Tracr: Compiled Transformers as a Laboratory for Interpretability.

Leike, Wu, et al (2023). Language models can explain neurons in language models.

Nanda, Chan, et al (ICLR 2023). Progress measures for grokking via mechanistic interpretability.

Burns, Ye, et al (2022). Discovering Latent Knowledge in Language Models Without Supervision. Introduces Contrast-Consistent Search, a method for discovering model beliefs.

Meng, Bau, et al (NeurIPS 2022). Locating and Editing Factual Associations in GPT.

Elhage, Hume, et al (2022). Toy models of superposition.

Chan, Garriga-Alonso, et al (2022). Causal scrubbing.

Elhage, Nanda, et al (2021). A Mathematical Framework for Transformer Circuits.

Goh, Cammarata, et al (Distill 2021). Multimodal Neurons in Artificial Neural Networks.

Cammarata, Carter, et al (Distill 2020). Thread: Circuits.

Olah, Satyanarayan, et al (Distill 2018). The Building Blocks of Interpretability.

Olah, Mordvintsev, Schubert (Distill 2017). Feature visualization.

Model evaluations

ARC Evals (2023). Evaluating Language-Model Agents on Realistic Autonomous Tasks.

Shevlane, Farquhar, et al (2023). Model evaluation for extreme risks. Proposes a framework to evaluate general-purpose models for catastrophic risks.

Pan, Shern, et al (ICML 2023). Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark.

Perez, Ringer, et al (2022). Discovering Language Model Behaviors with Model-Written Evaluations.

Perez, Huang et al (2022). Red Teaming Language Models with Language Models.

Forecasting

Tom Davidson (2023). What a compute-centric framework says about takeoff speeds.

Matthew Barnett (2023). A compute-based framework for thinking about the future of AI.

Ajeya Cotra (2020). Forecasting Transformative AI with Biological Anchors. (blog post, AN summary)

Kaplan, McCandlish, et al (2020). Scaling laws for neural language models.

Hernandez and Brown (2020). AI and Efficiency. An analysis showing that the amount of compute needed to train a neural net to the same performance on ImageNet classification has been decreasing by a factor of 2 every 16 months.

Amodei and Hernandez (2018). AI and Compute. An analysis showing that the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time.

Understanding incentives

Causal analysis of incentives

Farquhar, Carey, Everitt (2022). Path-Specific Objectives for Safer Agent Incentives.

Langlois and Everitt (2021). How RL Agents Behave When Their Actions Are Modified.

Carey, Langlois, et al (2020). The Incentives that Shape Behaviour. Introduces control and response incentives, which are feasible for the agent to act on. (blog post)

Everitt, Kumar, et al (2019). Modeling AGI Safety Frameworks with Causal Influence Diagrams. Suggests diagrams for modeling different AGI safety frameworks, including reward learning, counterfactual oracles, and safety via debate.

Everitt, Ortega, et al (2019). Understanding Agent Incentives using Causal Influence Diagrams. Introduces observation and intervention incentives.

Impact measures and side effects

Krakovna, Orseau, et al (NeurIPS 2020). Avoiding Side Effects By Considering Future Tasks. Proposes an auxiliary reward for possible future tasks that provides an incentive to avoid side effects, and formalizes interference incentives.

Turner, Ratzlaff, Tadepalli (NeurIPS 2020). Avoiding Side Effects in Complex Environments. A scalable implementation of Attainable Utility Preservation that avoids side effects in the SafeLife environment.

Saisubramanian, Kamar, Zilberstein (IJCAI 2020). A Multi-Objective Approach to Mitigate Negative Side Effects.

Rahaman, Wolf, et al (ICLR 2020). Learning the Arrow of Time. Proposes a potential-based reachability measure that is sensitive to magnitude of irreversible effects.

Shah, Krasheninnikov, et al (ICLR 2019). Preferences Implicit in the State of the World. Since the initial state of the environment is often optimized for human preferences, information from the initial state can be used to infer both side effects that should be avoided as well as preferences for how the environment should be organized. (blog post, code)

Turner, Hadfield-Menell, Tadepalli (AIES 2020). Conservative Agency via Attainable Utility Preservation. Introduces Attainable Utility Preservation, a general impact measure that penalizes change in the utility attainable by the agent compared to the stepwise inaction baseline, and avoids introducing certain bad incentives. (code)

Krakovna, Orseau, et al (2019). Penalizing side effects using stepwise relative reachability (version 2). A general impact measure that penalizes reducing reachability of states compared to the stepwise inaction baseline, and avoids introducing certain bad incentives. (version 2 blog post, version 1 blog post)

Eysenbach, Gu, et al (ICLR 2017). Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning. Reducing the number of irreversible actions taken by an RL agent by learning a reset policy that tries to return the environment to the initial state.

Armstrong and Levinstein (2017). Low Impact Artificial Intelligences. An intractable but enlightening definition of low impact for AI systems.

Power-seeking and shutdown

Joe Carlsmith (2023). Scheming AIs: Will AIs fake alignment during training in order to get power?

Turner and Tadepalli (NeurIPS 2022). Parametrically Retargetable Decision-Makers Tend To Seek Power.

Ryan Carey (AIES 2018). Incorrigibility in the CIRL Framework. Investigates the corrigibility of value learning agents under model misspecification.

El Mhamdi, Guerraoui, et al (NeurIPS 2017). Dynamic Safe Interruptibility for Decentralized Multi-Agent Reinforcement Learning.

Hadfield-Menell, Dragan, et al (AIES 2017). The Off-Switch Game. This paper studies the interruptibility problem as a game between human and robot, and investigates which incentives the robot could have to allow itself to be switched off.

Orseau and Armstrong (UAI 2016). Safely interruptible agents. Provides a formal definition of safe interruptibility and shows that off-policy RL agents are more interruptible than on-policy agents. (blog post)

Soares, Fallenstein, et al (2015). Corrigibility. Designing AI systems without incentives to resist corrective modifications by their creators.

Specification gaming

Victoria Krakovna. Master list of specification gaming examples. Includes examples from various existing sources. Suggest new examples through this form. (blog post)

Gao, Schulman, Hilton (2022). Scaling Laws for Reward Model Overoptimization.

Pan, Bhatia, Steinhardt (NeurIPS 2021). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models.

Krueger, Maharaj, Leike (2020). Hidden incentives for auto-induced distributional shift. Introduces the auto-induced distributional shift (ADS) problem, where the agent cheats by modifying the task distribution.

Lehman, Clune, et al (2018). The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities.

Everitt, Krakovna, et al (IJCAI 2017). Reinforcement learning with a corrupted reward channel. A formalization of the reward hacking problem in terms of true and corrupt reward, a proof that RL agents cannot overcome reward corruption, and a framework for giving the agent extra information to overcome reward corruption. (blog post, demo, code)

Amodei and Clark (2016). Faulty Reward Functions in the Wild. An example of reward function gaming in a boat racing game, where the agent gets a higher score by going in circles and hitting the same targets than by actually playing the game.

Jessica Taylor (AAAI Workshop 2016). Quantilizers: A Safer Alternative to Maximizers for Limited Optimization. Addresses objective specification problems by choosing from the top quantile of high-utility actions instead of the optimal action.

Tampering and wireheading

Everitt and Hutter (2019). Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective. Classifies reward tampering problems into reward function tampering, feedback tampering, and reward function input tampering, and compares different solutions to those problems. (blog post)

Everitt and Hutter (2016). Avoiding Wireheading with Value Reinforcement Learning. An alternative to RL that reduces the incentive to wirehead.

Laurent Orseau (2015). Mortal universal agents and wireheading. An investigation into how different types of artificial agents respond to opportunities to wirehead (unintended shortcuts to maximize their objective function).

Foundations

John Wentworth (2021). Testing The Natural Abstraction Hypothesis: Project Intro.

Scott Garrabrant (2021). Finite factored sets. An alternative framework to Pearlian causal inference, with applications to embedded agency.

Alex Flint (2020). The ground of optimization.

Yudkowsky and Soares (2017). Functional Decision Theory: A New Theory of Instrumental Rationality. New decision theory that avoids the pitfalls of causal and evidential decision theories. (blog post)

Garrabrant, Benson-Tilsen, et al (2016). Logical Induction. A computable algorithm for the logical induction problem.

Andrew Critch (2016). Parametric Bounded Löb’s Theorem and Robust Cooperation of Bounded Agents. FairBot cooperates in the Prisoner’s Dilemma if and only if it can prove that the opponent cooperates. Surprising result: FairBots cooperate with each other. (blog post: Open-source game theory is weird)

Yudkowsky and Herreshoff (2013). Tiling Agents for Self-Modifying AI, and the Löbian Obstacle.

Other technical work

Containment

Cohen, Vellambi, Hutter (2019). Asymptotically Unambitious Artificial General Intelligence. Introduces Boxed Myopic Artificial Intelligence (BoMAI), an RL algorithm that remains indifferent to gaining power in the outside world as its capabilities increase.

Babcock, Kramar, Yampolskiy (2017). Guidelines for Artificial Intelligence Containment.

Eliezer Yudkowsky (2002). The AI Box Experiment.

Environments

Rohin Shah (2021). BASALT: A Benchmark for Learning from Human Feedback. A set of Minecraft environments and a human evaluation protocol for solving tasks with no pre-specified reward function.

Wainwright and Eckersley (2019). SafeLife 1.0: Exploring Side Effects in Complex Environments. Presents a collection of environment levels testing for side effects in a Game of Life setting. (blog post)

Michaël Trazzi (2018). A Gym Gridworld Environment for the Treacherous Turn. (code)

Leech, Kubicki, et al (2018). Preventing Side-effects in Gridworlds. A collection of gridworld environments to test for side effects. (code)

Leike, Martic, et al (2017). AI Safety Gridworlds. A collection of simple environments to illustrate and test for different AI safety problems in reinforcement learning agents. (blog post, code)

Stuart Armstrong (2015). A toy model of the control problem. Agent obstructs the supervisor’s camera to avoid getting turned off.

Collections of technical works

Rohin Shah. Alignment newsletter. A weekly newsletter summarizing recent work relevant to AI safety.

Paul Christiano. AI Alignment. A blog on designing safe, efficient AI systems (approval-directed agents, aligned reinforcement learning agents, etc).

CHAI bibliography

MIRI publications

FHI publications

Projects

Requests for proposals

Beckstead and Bergal (2021). Request for proposals for projects in AI alignment that work with deep learning systems.

Ajeya Cotra (2021). Techniques for enhancing human feedback.

Chris Olah (2021). Interpretability. They would like to see research building towards the ability to “reverse engineer” trained neural networks into human-understandable algorithms, enabling auditors to catch unanticipated safety problems in these models.

Owain Evans (2021). Truthful and honest AI.

Jacob Steinhardt (2021). Measuring and forecasting risks.

Project ideas

Neel Nanda (2022). 200 Concrete Open Problems in Mechanistic Interpretability.

Evan Hubinger (2022). This is a list of random, assorted AI safety ideas that I think somebody should try to write up and/or work on at some point.

Richard Ngo (2022). Some conceptual alignment research projects.

Jacob Hilton (2022). Procedurally evaluating factual accuracy: a request for research.

Nate Soares (2021). Visible Thoughts Project and Bounty Announcement.

Paul Christiano (2021). Why I’m excited about Redwood Research’s current project (to train a model that completes short snippets of fiction without outputting text where someone gets injured).

Evan Hubinger (2021). Automating Auditing: An ambitious concrete technical research proposal.

Paul Christiano (2021). How much chess engine progress is about adapting to bigger computers?

Anderljung and Carlier (2021). Some AI governance research ideas.

Paul Christiano (2021). Experimentally evaluating whether honesty generalizes.

Owain Evans (2021). AI Safety Research Project Ideas.

Ajeya Cotra (2021). The case for aligning narrowly superhuman models – Potential near-future projects: “sandwiching”.

Evan Hubinger (2019). Concrete experiments in inner alignment and Towards an empirical investigation of inner alignment.