AI safety resources

Regularly updated resource list to get up to speed on the main ideas in the field of long-term AI safety. The resources are selected for relevance and/or brevity, and the list is not meant to be comprehensive.

Note that I did not include literature on well-studied safety problems like safe exploration, distributional shift, adversarial examples, or interpretability (see e.g. Concrete Problems or the CHAI bibliography for extensive references on these topics).



FLI. AI risk background and FAQ. At the bottom of the background page, there is a more extensive list of resources on AI safety.

Robert Miles. Computerphile videos on AI safety. Introductory videos on AI safety concepts and motivation.

Kelsey Piper (2018). The case for taking AI seriously as a threat to humanity.

Sutskever and Amodei (2017). Wall Street Journal: Protecting Against AI’s Existential Threat.

Cade Metz (2017). New York Times: Teaching A.I. Systems to Behave Themselves.

Tim Urban (2015). Wait But Why: The AI Revolution. An accessible introduction to AI risk forecasts and arguments (with cute hand-drawn diagrams, and a few corrections from Luke Muehlhauser).

OpenPhil (2015). Potential risks from advanced artificial intelligence. An overview of AI risks and timelines, possible interventions, and current actors in this space.

For a technical audience

Richard Ngo (2019). Disentangling arguments for the importance of AI safety.

Robert Miles. AI Safety YouTube channel. Introductory videos on technical challenges in AI safety.

Stuart Russell:

  • The long-term future of AI (longer version) (2015). A video of Russell’s classic talk, discussing why it makes sense for AI researchers to think about AI safety, and going over various misconceptions about the issues.
  • Concerns of an AI pioneer (2015). An interview with Russell on the importance of provably aligning AI with human values, and the challenges of value alignment research.
  • On Myths and Moonshine (2014). Russell’s response to the “Myth of AI” question on, which draws an analogy between AI research and nuclear research, and points out some dangers of optimizing a misspecified utility function.

Scott Alexander (2015). No time like the present for AI safety work. An overview of long-term AI safety challenges, e.g. preventing wireheading and formalizing ethics.

Victoria Krakovna (2015). AI risk without an intelligence explosion. An overview of long-term AI risks besides the (overemphasized) intelligence explosion / hard takeoff scenario, arguing why intelligence explosion skeptics should still think about AI safety.

Stuart Armstrong (2014). Smarter Than Us: The Rise Of Machine Intelligence. A short ebook discussing potential promises and challenges presented by advanced AI, and the interdisciplinary problems that need to be solved on the way there.

Steve Omohundro (2007). The basic AI drives. A classic paper arguing that sufficiently advanced AI systems are likely to develop drives such as self-preservation and resource acquisition independently of their assigned objectives.


Strategy overviews

Brundage, Avin, Clark, Toner, Eckersley, et al (2018). The Malicious Use of AI: Forecasting, Prevention and Mitigation.

Grace, Salvatier, Dafoe, Zhang, Evans (2017). When Will AI Exceed Human Performance? Evidence from AI Experts.  A detailed survey of >300 authors at NIPS and ICML conferences, on when AI is likely to outperform humans in various domains.

Nick Bostrom (Global Policy, 2017). Strategic Implications of Openness in AI Development.

Technical overviews

Manheim and Garrabrant (2018). Categorizing Variants of Goodhart’s Law. Explores different failure modes of overoptimizing metrics at the expense of the real objective.

Paul Christiano. Podcast with 80000 Hours. Discusses how OpenAI is developing real solutions to the ‘AI alignment problem’, and Paul’s vision of how humanity will progressively hand over decision-making to AI systems.

Ortega, Maini, et al (2018). Building safe artificial intelligence: specification, robustness, and assurance.

Everitt, Lea, Hutter (2018). AGI safety literature review.

FLI (2015). A survey of research priorities for robust and beneficial AI

Jacob Steinhardt (2015). Long-Term and Short-Term Challenges to Ensuring the Safety of AI Systems. A taxonomy of AI safety issues that require ordinary vs extraordinary engineering to address.

Nate Soares (2015). Safety engineering, target selection, and alignment theory. Identifies and motivates three major areas of AI safety research.

Research agendas

Stuart Armstrong (2019). Synthesising a human’s preferences into a utility function. A research agenda focused on identifying partial human preferences (contrasting two situations along a single variable) and incorporating them into a utility function.

Leike, Krueger, Everitt, Martic, Maini, Legg (2018). Scalable agent alignment via reward modeling: a research direction. (blog post)

Allan Dafoe (2018). AI Governance: A Research Agenda.

Garrabrant and Demski (2018). Embedded agency. How can we align agents embedded in a world that is bigger than themselves? This post discusses and motivates the embedded agency problem and its subproblems: decision theory, embedded world models, robust delegation, and subsystem alignment.

Soares and Fallenstein (2017). Aligning Superintelligence with Human Interests: A Technical Research Agenda

Amodei, Olah, Steinhardt, Christiano, Schulman, Mané (2016). Concrete Problems in AI Safety. Research agenda focusing on accident risks that apply to current ML systems as well as more advanced future AI systems.

Taylor, Yudkowsky, LaVictoire, Critch (2016). Alignment for Advanced Machine Learning Systems


Stuart Russell (2019). Human Compatible: AI and the Problem of Control.

Eric Drexler (2019). Reframing Superintelligence: Comprehensive AI Services as General Intelligence.

Max Tegmark (2017). Life 3.0: Being Human in the Age of AI.

Nick Bostrom (2014). Superintelligence: Paths, Dangers, Strategies.

Technical work

Agent foundations

Hubinger, van Merwijk, Mikulik, Skalse, Garrabrant (2019). Risks from Learned Optimization in Advanced Machine Learning Systems. Introduces the concept of mesa-optimization, which occurs when a learned model is itself an optimizer that may be pursuing a different objective than the outer model.

Yudkowsky and Soares (2017). Functional Decision Theory: A New Theory of Instrumental Rationality. New decision theory that avoids the pitfalls of causal and evidential decision theories. (blog post)

Garrabrant, Benson-Tilsen, Critch, Soares, Taylor (2016). Logical Induction. A computable algorithm for the logical induction problem.

Andrew Critch (2016). Parametric Bounded Löb’s Theorem and Robust Cooperation of Bounded Agents. FairBot cooperates in the Prisoner’s Dilemma if and only if it can prove that the opponent cooperates. Surprising result: FairBots cooperate with each other. (blog post: Open-source game theory is weird)

Yudkowsky and Herreshoff (2013). Tiling Agents for Self-Modifying AI, and the Löbian Obstacle.

Causal analysis of incentives

Carey, Langlois, Everitt, Legg (2020). The Incentives that Shape Behaviour. Introduces control and response incentives, which are feasible for the agent to act on.

Everitt, Kumar, Krakovna, Legg (2019). Modeling AGI Safety Frameworks with Causal Influence Diagrams. Suggests diagrams for modeling different AGI safety frameworks, including reward learning, counterfactual oracles, and safety via debate.

Everitt, Ortega, Barnes, Legg (2019). Understanding Agent Incentives using Causal Influence Diagrams. Introduces observation and intervention incentives.


Cohen, Vellambi, Hutter (2019). Asymptotically Unambitious Artificial General Intelligence. Introduces Boxed Myopic Artificial Intelligence (BoMAI), an RL algorithm that remains indifferent to gaining power in the outside world as its capabilities increase.

Babcock, Kramar, Yampolskiy (2017). Guidelines for Artificial Intelligence Containment.

Eliezer Yudkowsky (2002). The AI Box Experiment.


Wainwright and Eckersley (2019). SafeLife 1.0: Exploring Side Effects in Complex Environments. Presents a collection of environment levels testing for side effects in a Game of Life setting. (blog post)

Leike, Martic, Krakovna, Ortega, Everitt, Lefrancq, Orseau, Legg (2017). AI Safety Gridworlds. A collection of simple environments to illustrate and test for different AI safety problems in reinforcement learning agents. (blog post, code)

Stuart Armstrong. A toy model of the control problem. Agent obstructs the supervisor’s camera to avoid getting turned off.

Michaël Trazzi. A Gym Gridworld Environment for the Treacherous Turn. (code)

Leech, Kubicki, Cooper, McGrath. Preventing Side-effects in Gridworlds. A collection of gridworld environments to test for side effects. (code)

Impact measures and side effects

Rahaman, Wolf, Goyal, Remme, Bengio (ICLR 2020). Learning the Arrow of Time. Proposes a potential-based reachability measure that is sensitive to magnitude of irreversible effects.

Shah, Krasheninnikov, Alexander, Abbeel, Dragan (ICLR 2019). Preferences Implicit in the State of the World.  Since the initial state of the environment is often optimized for human preferences, information from the initial state can be used to infer both side effects that should be avoided as well as preferences for how the environment should be organized. (blog post, code)

Turner, Hadfield-Menell, Tadepalli (AIES 2020). Conservative Agency via Attainable Utility Preservation. Introduces Attainable Utility Preservation, a general impact measure that penalizes change in the utility attainable by the agent compared to the stepwise inaction baseline, and avoids introducing certain bad incentives. (code)

Krakovna, Orseau, Kumar, Martic, Legg (2019). Penalizing side effects using stepwise relative reachability (version 2). A general impact measure that penalizes reducing reachability of states compared to the stepwise inaction baseline, and avoids introducing certain bad incentives. (version 2 blog post, version 1 blog post)

Zhang, Durfee, Singh (IJCAI 2018). Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes. Whitelisting the features of the state that the agent is permitted to change, and allowing the agent to query the human about changing a small number of features outside the whitelist.

Eysenbach, Gu, Ibarz, Levine (ICLR 2017). Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning. Reducing the number of irreversible actions taken by an RL agent by learning a reset policy that tries to return the environment to the initial state.

Armstrong and Levinstein (2017). Low Impact Artificial Intelligences. An intractable but enlightening definition of low impact for AI systems.

Interruptibility and corrigibility

Alexander Turner (2019). Optimal Farsighted Agents Tend to Seek Power. Formalizes the concept of instrumental convergence in an MDP setting. (blog post)

Ryan Carey (AIES 2018). Incorrigibility in the CIRL Framework. Investigates the corrigibility of value learning agents under model misspecification.

El Mhamdi, Guerraoui, Hendrikx, Maurer (NIPS 2017). Dynamic Safe Interruptibility for Decentralized Multi-Agent Reinforcement Learning.

Hadfield-Menell, Dragan, Abbeel, Russell (AIES 2017). The Off-Switch Game. This paper studies the interruptibility problem as a game between human and robot, and investigates which incentives the robot could have to allow itself to be switched off.

Orseau and Armstrong (UAI 2016). Safely interruptible agents. Provides a formal definition of safe interruptibility and shows that off-policy RL agents are more interruptible than on-policy agents. (blog post)

Soares, Fallenstein, Yudkowsky, Armstrong (2015). Corrigibility. Designing AI systems without incentives to resist corrective modifications by their creators.​

Inverse reinforcement learning

Malik, Palaniappan, Fisac, Hadfield-Menell, Russell, Dragan (ICML 2018). An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning. Uses the property of CIRL that the human is a full-information agent to reduce the complexity of the problem by an exponential factor.

Ratner, Hadfield-Menell, Dragan (R:SS 2018).  Simplifying Reward Design through Divide-and-Conquer.  An approach for inferring a common reward across multiple environments with separately specified rewards.

Fu, Luo, Levine (ICLR 2018). Learning Robust Rewards with Adversarial Inverse Reinforcement Learning. Using an adversarial approach to IRL to recover reward functions that are invariant to changing environment dynamics and distributional shift, by factoring out reward shaping from the learned reward function. This resolves IRL’s reward ambiguity in favor of safer reward functions.

Hadfield-Menell, Milli, Abbeel, Russell, Dragan (NIPS 2017). Inverse Reward Design. Formalizes the problem of inferring the true objective intended by the human based on the designed reward, and proposes an approach that helps to avoid side effects and reward hacking.

Fisac, Gates, Hamrick, Liu, Hadfield-Menell, Palaniappan, Malik, Sastry, Griffiths, Dragan (ISRR 2017). Pragmatic-Pedagogic Value Alignment. A cognitive science approach to the cooperative inverse reinforcement learning problem.

Milli, Hadfield-Menell, Dragan, Russell (IJCAI 2017). Should robots be obedient? Obedience to humans may sound like a great thing, but blind obedience can get in the way of learning human preferences.

Amin, Jiang, Singh (NIPS 2017). Repeated Inverse Reinforcement Learning. Separates the reward function into a task-specific component and an intrinsic component. In a sequence of tasks, the agent learns the intrinsic component while trying to avoid surprising the human.

Armstrong and Leike (2016). Towards Interactive Inverse Reinforcement Learning. The agent gathers information about the reward function through interaction with the environment, while at the same time maximizing this reward function, balancing the incentive to learn with the incentive to bias.

Hadfield-Menell, Dragan, Abbeel, Russell (NIPS 2016). Cooperative inverse reinforcement learning. Defines value learning as a cooperative game where the human tries to teach the agent about their reward function, rather than giving optimal demonstrations like in standard IRL.

Evans, Stuhlmüller, Goodman (AAAI 2016). Learning the Preferences of Ignorant, Inconsistent Agents. Relaxing some assumptions on human rationality when inferring preferences.

Scalable oversight

Reddy, Dragan, Levine, Legg, Leike (2019). Learning Human Objectives by Evaluating Hypothetical Behavior. Introduces ReQueST (reward query synthesis via trajectory optimization), an algorithm for safely learning a reward model by querying the user about hypothetical behaviors. (blog post)

Ibarz, Leike, Pohlen, Irving, Legg, Amodei (NIPS 2018). Reward learning from human preferences and demonstrations in Atari.

Christiano, Shlegeris, Amodei (2018). Supervising strong learners by amplifying weak experts. Proposes Iterated Amplification, which progressively builds up a training signal for difficult problems by combining solutions to easier subproblems.

Irving, Christiano, Amodei (2018). AI safety via debate. Agents are trained to debate topics and point out flaws in one another’s arguments, and a human decides who wins. This could allow agents to learn to perform tasks at a superhuman level while remaining aligned with human preferences. (blog post)

Saunders, Sastry, Stuhlmueller, Evans (AAMAS 2018). Trial without Error: Towards Safe Reinforcement Learning via Human Intervention. (blog post)

Christiano, Leike, Brown, Martic, Legg, Amodei (NIPS 2017). Deep reinforcement learning from human preferences. Communicating complex goals to AI systems using human feedback (comparing pairs of agent trajectory segments).

Abel, Salvatier, Stuhlmüller, Evans (2017). Agent-Agnostic Human-in-the-Loop Reinforcement Learning.

Specification gaming

Master list of specification gaming examples. Includes examples from various existing sources. Suggest new examples through this form. (blog post)

Krueger, Maharaj, Legg, Leike (ICLR Workshop 2019). Misleading meta-objectives and hidden incentives for distributional shift. Introduces the self-induced distributional shift (SIDS) problem, where the agent cheats by modifying the task distribution.

Lehman, Clune, Misevic, et al (2018). The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities.

Everitt, Krakovna, Orseau, Hutter, Legg (IJCAI 2017). Reinforcement learning with a corrupted reward channel. A formalization of the reward hacking problem in terms of true and corrupt reward, a proof that RL agents cannot overcome reward corruption, and a framework for giving the agent extra information to overcome reward corruption. (blog postdemocode)

Amodei and Clark (2016). Faulty Reward Functions in the Wild. An example of reward function gaming in a boat racing game, where the agent gets a higher score by going in circles and hitting the same targets than by actually playing the game.

Jessica Taylor (AAAI Workshop 2016). Quantilizers: A Safer Alternative to Maximizers for Limited Optimization. Addresses objective specification problems by choosing from the top quantile of high-utility actions instead of the optimal action.

Tampering and wireheading

Everitt and Hutter (2019). Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective. Classifies reward tampering problems into reward function tampering, feedback tampering, and reward function input tampering, and compares different solutions to those problems. (blog post)

Everitt and Hutter (2016). Avoiding Wireheading with Value Reinforcement Learning. An alternative to RL that reduces the incentive to wirehead.

Laurent Orseau (2015). Mortal universal agents and wireheading. An investigation into how different types of artificial agents respond to opportunities to wirehead (unintended shortcuts to maximize their objective function).

Collections of technical works

Rohin Shah. Alignment newsletter. A weekly newsletter summarizing recent work relevant to AI safety.

CHAI bibliography

Berkeley AGI Safety course (CS294)

MIRI publications

FHI publications

FLI grantee publications (scroll down)

Paul Christiano. AI Alignment. A blog on designing safe, efficient AI systems (approval-directed agents, aligned reinforcement learning agents, etc).


AI Safety Discussion Facebook group

AI Safety Discussion (Open) Facebook group

Careers in AI Safety Facebook group

AI Safety Camp (Facebook group)

Road to AI Safety Excellence (RAISE) (Facebook group)

80000 Hours AI Safety mailing list

Human-aligned AI summer school

If there are any resources missing from this list that you think are a must-read, please let me know! If you want to go into AI safety research, check out 80000 Hours guidelines and the AI Safety Syllabus.

(This list of resources was originally published as a blog post.)