Reading list to get up to speed on the main ideas in the field of long-term AI safety. The resources are selected for relevance and/or brevity, and the list is not meant to be comprehensive. [Updated on 3 May 2018.]
For a popular audience:
Sutskever and Amodei, 2017. Wall Street Journal: Protecting Against AI’s Existential Threat
Cade Metz, 2017. New York Times: Teaching A.I. Systems to Behave Themselves.
OpenPhil, 2015. Potential risks from advanced artificial intelligence. An overview of AI risks and timelines, possible interventions, and current actors in this space.
Robert Miles. Computerphile videos on AI safety. Introductory videos on AI safety concepts and motivation.
Max Tegmark, 2017. Life 3.0: Being Human in the Age of AI.
For a more technical audience:
- The long-term future of AI (longer version), 2015. A video of Russell’s classic talk, discussing why it makes sense for AI researchers to think about AI safety, and going over various misconceptions about the issues.
- Concerns of an AI pioneer, 2015. An interview with Russell on the importance of provably aligning AI with human values, and the challenges of value alignment research.
- On Myths and Moonshine, 2014. Russell’s response to the “Myth of AI” question on Edge.org, which draws an analogy between AI research and nuclear research, and points out some dangers of optimizing a misspecified utility function.
Robert Miles. AI Safety YouTube channel. Introductory videos on technical challenges in AI safety.
Scott Alexander, 2015. No time like the present for AI safety work. An overview of long-term AI safety challenges, e.g. preventing wireheading and formalizing ethics.
Grace, Salvatier, Dafoe, Zhang, Evans, 2017. When Will AI Exceed Human Performance? Evidence from AI Experts. A detailed survey of >300 authors at NIPS and ICML conferences, on when AI is likely to outperform humans in various domains.
Victoria Krakovna, 2015. AI risk without an intelligence explosion. An overview of long-term AI risks besides the (overemphasized) intelligence explosion / hard takeoff scenario, arguing why intelligence explosion skeptics should still think about AI safety.
Stuart Armstrong, 2014. Smarter Than Us: The Rise Of Machine Intelligence. A short ebook discussing potential promises and challenges presented by advanced AI, and the interdisciplinary problems that need to be solved on the way there.
Brundage, Avin, Clark, Toner, Eckersley, et al (26 authors). The Malicious Use of AI: Forecasting, Prevention and Mitigation.
Leike, Martic, Krakovna, Ortega, Everitt, Lefrancq, Orseau, Legg, 2017. AI Safety Gridworlds. A collection of simple environments to test long-term AI safety problems for reinforcement learning agents. (blog post, code)
Soares and Fallenstein, 2017. Aligning Superintelligence with Human Interests: A Technical Research Agenda
Amodei, Olah, Steinhardt, Christiano, Schulman, Mané, 2016. Concrete Problems in AI safety. Research agenda focusing on accident risks that apply to current ML systems as well as more advanced future AI systems.
Taylor, Yudkowsky, LaVictoire, Critch, 2016. Alignment for Advanced Machine Learning Systems
Jacob Steinhardt, 2015. Long-Term and Short-Term Challenges to Ensuring the Safety of AI Systems. A taxonomy of AI safety issues that require ordinary vs extraordinary engineering to address.
Nate Soares, 2015. Safety engineering, target selection, and alignment theory. Identifies and motivates three major areas of AI safety research.
Nick Bostrom, 2014. Superintelligence: Paths, Dangers, Strategies. A seminal book outlining long-term AI risk considerations.
Steve Omohundro, 2007. The basic AI drives. A classic paper arguing that sufficiently advanced AI systems are likely to develop drives such as self-preservation and resource acquisition independently of their assigned objectives.
Hadfield-Menell, Milli, Abbeel, Russell, Dragan, 2017. Inverse Reward Design. Formalizes the problem of inferring the true objective intended by the human based on the designed reward, and proposes an approach that helps to avoid side effects and reward hacking.
Fu, Luo, Levine, 2017. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning. Using an adversarial approach to IRL to recover reward functions that are invariant to changing environment dynamics and distributional shift, by factoring out reward shaping from the learned reward function. This resolves IRL’s reward ambiguity in favor of safer reward functions.
Fisac, Gates, Hamrick, Liu, Hadfield-Menell, Palaniappan, Malik, Sastry, Griffiths, Dragan, 2017. Pragmatic-Pedagogic Value Alignment. A cognitive science approach to the cooperative inverse reinforcement learning problem.
Milli, Hadfield-Menell, Dragan, Russell, 2017. Should robots be obedient? Obedience to humans may sound like a great thing, but blind obedience can get in the way of learning human preferences.
Saunders, Sastry, Stuhlmueller, Evans, 2017. Trial without Error: Towards Safe Reinforcement Learning via Human Intervention. (blog post)
Amin, Jiang, Singh, 2017. Repeated Inverse Reinforcement Learning. Separates the reward function into a task-specific component and an intrinsic component. In a sequence of task, the agent learns the intrinsic component while trying to avoid surprising the human.
Armstrong and Leike, 2016. Towards Interactive Inverse Reinforcement Learning. The agent gathers information about the reward function through interaction with the environment, while at the same time maximizing this reward function, balancing the incentive to learn with the incentive to bias.
Hadfield-Menell, Dragan, Abbeel, Russell, 2016. Cooperative inverse reinforcement learning. Defines value learning as a cooperative game where the human tries to teach the agent about their reward function, rather than giving optimal demonstrations like in standard IRL.
Evans, Stuhlmüller, Goodman, 2016. Learning the Preferences of Ignorant, Inconsistent Agents.
Irving, Christiano, Amodei, 2018. AI safety via debate. Agents are trained debate topics and point out flaws in one another’s arguments, and a human decides who wins. This could allow agents to learn to perform tasks at a superhuman level while remaining aligned with human preferences. (blog post)
Christiano, Leike, Brown, Martic, Legg, Amodei, 2017. Deep reinforcement learning from human preferences. Communicating complex goals to AI systems using human feedback (comparing pairs of agent trajectory segments).
Abel, Salvatier, Stuhlmüller, Evans. Agent-Agnostic Human-in-the-Loop Reinforcement Learning.
Specification gaming / wireheading:
Manheim and Garrabrant, 2018. Categorizing Variants of Goodhart’s Law. Explores different failure modes of overoptimizing metrics at the expense of the real objective (which often results in specification gaming).
Lehman, Clune, Misevic, et al, 2018. The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities.
Everitt, Krakovna, Orseau, Hutter, Legg, 2017. Reinforcement learning with a corrupted reward channel. A formalization of the reward misspecification problem in terms of true and corrupt reward, a proof that RL agents cannot overcome reward corruption, and a framework for giving the agent extra information to overcome reward corruption. (blog post)
Amodei and Clark, 2016. Faulty Reward Functions in the Wild. An example of reward function gaming in a boat racing game, where the agent gets a higher score by going in circles and hitting the same targets than by actually playing the game.
Everitt and Hutter, 2016. Avoiding Wireheading with Value Reinforcement Learning. An alternative to RL that reduces the incentive to wirehead.
Laurent Orseau, 2015. Mortal universal agents and wireheading. An investigation into how different types of artificial agents respond to opportunities to wirehead (unintended shortcuts to maximize their objective function).
Interruptibility / corrigibility:
Ryan Carey. Incorrigibility in the CIRL Framework. Investigates the corrigibility of value learning agents under model misspecification.
Hadfield-Menell, Dragan, Abbeel, Russell. The Off-Switch Game. This paper studies the interruptibility problem as a game between human and robot, and investigates which incentives the robot could have to allow itself to be switched off.
El Mahdi El Mhamdi, Guerraoui, Hendrikx, Maurer, 2017. Dynamic Safe Interruptibility for Decentralized Multi-Agent Reinforcement Learning.
Orseau and Armstrong, 2016. Safely interruptible agents. Provides a formal definition of safe interruptibility and shows that off-policy RL agents are more interruptible than on-policy agents. (blog post)
Soares, Fallenstein, Yudkowsky, Armstrong, 2015. Corrigibility. Designing AI systems without incentives to resist corrective modifications by their creators.
Eysenbach, Gu, Ibarz, Levine, 2017. Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning. Reducing the number of irreversible actions taken by an RL agent by learning a reset policy that tries to return the environment to the initial state.
Armstrong and Levinstein, 2017. Low Impact Artificial Intelligences. An intractable but enlightening definition of low impact for AI systems.
Babcock, Kramar, Yampolskiy, 2017. Guidelines for Artificial Intelligence Containment.
Garrabrant, Benson-Tilsen, Critch, Soares, Taylor, 2016. Logical Induction. A computable algorithm for the logical induction problem.
Note: I did not include literature on less neglected areas of the field like safe exploration, distributional shift, adversarial examples, or interpretability (see e.g. Concrete Problems or the CHAI bibliography for extensive references on these topics).
Collections of technical works
FLI grantee publications (scroll down)
Paul Christiano. AI control. A blog on designing safe, efficient AI systems (approval-directed agents, aligned reinforcement learning agents, etc).
(This list was originally published as a blog post.)