Category Archives: AI safety

Highlights from the ICLR conference: food, ships, and ML security

It’s been an eventful few days at ICLR in the coastal town of Toulon in Southern France, after a pleasant train ride from London with a stopover in Paris for some sightseeing. There was more food than is usually provided at conferences, and I ended up almost entirely subsisting on tasty appetizers. The parties were memorable this year, including one in a vineyard and one in a naval museum. The overall theme of the conference setting could be summarized as “finger food and ships”.

food-and-ships

There were a lot of interesting papers this year, especially on machine learning security, which will be the focus on this post. (Here is a great overview of the topic.)

On the attack side, adversarial perturbations now work in physical form (if you print out the image and then take a picture) and they can also interfere with image segmentation. This has some disturbing implications for fooling vision systems in self-driving cars, such as impeding them from recognizing pedestrians. Adversarial examples are also effective at sabotaging neural network policies in reinforcement learning at test time.

adv-ex-policy.png

In more encouraging news, adversarial examples are not entirely transferable between different models. For targeted examples, which aim to be misclassified as a specific class, the target class is not preserved when transferring to a different model. For example, if an image of a school bus is classified as a crocodile by the original model, it has at most 4% probability of being seen as a crocodile by another model. The paper introduces an ensemble method for developing adversarial examples whose targets do transfer, but this seems to only work well if the ensemble includes a model with a similar architecture to the new model.

On the defense side, there were some new methods for detecting adversarial examples. One method augments neural nets with a detector subnetwork, which works quite well and generalizes to new adversaries (if they are similar to or weaker than the adversary used for training). Another approach analyzes adversarial images using PCA, and finds that they are similar to normal images in the first few thousand principal components, but have a lot more variance in later components. Note that the reverse is not the case – adding arbitrary variation in trailing components does not necessarily encourage misclassification.

There has also been progress in scaling adversarial training to larger models and data sets, which also found that higher-capacity models are more resistant against adversarial examples than lower-capacity models. My overall impression is that adversarial attacks are still ahead of adversarial defense, but the defense side is starting to catch up.

20170426_202937.jpg

(Cross-posted to the FLI blog and Approximately Correct. Thanks to Janos Kramar for his feedback on this post.)

2016-17 New Year review

2016 progress

Research / career:

  • Got a job at DeepMind as a research scientist in AI safety.
  • Presented MiniSPN paper at ICLR workshop.
  • Finished RNN interpretability paper and presented at ICML and NIPS workshops.
  • Attended the Deep Learning Summer School.
  • Finished and defended PhD thesis.
  • Moved to London and started working at DeepMind.

FLI:

  • Talk and panel (moderator) at Effective Altruism Global X Boston
  • Talk and panel at the Governance of Emerging Technologies conference at ASU
  • Talk and panel at Brain Bar Budapest
  • AI safety session at OpenAI unconference
  • Talk and panel at Effective Altruism Global X Oxford
  • Talk and panel at Cambridge Catastrophic Risk Conference run by CSER

Rationality / effectiveness:

  • Went to a 5-day Zentensive meditation retreat with Janos, in between grad school and moving to London. This was very helpful for practicing connecting with my direct emotional experience, and a good way to reset during a life transition.
  • Stopped using 42goals (too glitchy) and started recording data in a Google form emailed to myself daily. Now I am actually entering accurate data every day instead of doing it retroactively whenever I remember. I tried a number of goal tracking apps, but all of them seemed too inflexible (I was surprised not to find anything that provides correlation charts between different goals, e.g. meditation vs. hours of sleep).

Random cool things:

  • Hiked in the Andes to an altitude of 17,000 feet.
  • Visited the Grand Canyon.
  • New countries visited: UK, Bolivia, Spain.
  • Started a group house in London (moving there in a few weeks).
  • Started contributing to the new blog Approximately Correct on societal impacts of machine learning.

IMG_1617.JPG

2016 prediction outcomes

Resolutions:

  1. Finish PhD thesis (70%) – done
  2. Write at least 12 blog posts (40%) – 9
  3. Meditate at least 200 days (50%) – 245
  4. Exercise at least 200 days (50%) – 282
  5. Do at least 5 pullups in a row (40%) – still only 2-3
  6. Record at least 50 new thoughts (50%) – 29
  7. Stay up past 1:30am at most 20% of the nights (40%) – 26.8%
  8. Do at least 10 pomodoros per week on average (50%) – 13

Predictions:

  1. At least one paper accepted for publication (70%) – two papers accepted to workshops
  2. I will get at least one fellowship (40%)
  3. Insomnia at most 20% of nights (20%) – 18.3%
  4. FLI will co-organize at least 3 AI safety workshops (50%) – AAAI, ICML, NIPS

Calibration:

  • Low predictions (20-40%): 1/5 = 20% (overconfident)
  • Medium predictions (50-70%): 6/7 = 85% (underconfident)
  • It’s interesting that my 40% predictions were all wrong, and my 50% predictions were almost all correct. I seem to be translating system 1 labels of ‘not that likely’ and ‘reasonably likely’ to 40% and 50% respectively, while they should translate to something more like 25% and 70%. After the overconfident predictions last year, I tried to tone down the predictions for this year, but the lower ones didn’t get toned down enough.
  • I seem to be more accurate on predictions than resolutions, probably due to wishful thinking. Experimenting with no resolutions for next year.

2017 predictions

  1. Our AI safety team will have at least two papers accepted for publication at a major conference, not counting workshops (70%).
  2. I will write at least 9 blog posts (50%).
  3. I will meditate at least 250 days (45%).
  4. I will exercise at least 250 days (55%).
  5. I will visit at least 2 new countries (80%).
  6. I will attend Burning Man (85%).

AI Safety Highlights from NIPS 2016

This year’s Neural Information Processing Systems conference was larger than ever, with almost 6000 people attending, hosted in a huge convention center in Barcelona, Spain. The conference started off with two exciting announcements on open-sourcing collections of environments for training and testing general AI capabilities – the DeepMind Lab and the OpenAI Universe. Among other things, this is promising for testing safety properties of ML algorithms. OpenAI has already used their Universe environment to give an entertaining and instructive demonstration of reward hacking that illustrates the challenge of designing robust reward functions.

I was happy to see a lot of AI-safety-related content at NIPS this year. The ML and the Law symposium and Interpretable ML for Complex Systems workshop focused on near-term AI safety issues, while the Reliable ML in the Wild workshop also covered long-term problems. Here are some papers relevant to long-term AI safety:

Inverse Reinforcement Learning

Cooperative Inverse Reinforcement Learning (CIRL) by Hadfield-Menell, Russell, Abbeel, and Dragan (main conference). This paper addresses the value alignment problem by teaching the artificial agent about the human’s reward function, using instructive demonstrations rather than optimal demonstrations like in classical IRL (e.g. showing the robot how to make coffee vs having it observe coffee being made). (3-minute video)

cirl

ssrlGeneralizing Skills with Semi-Supervised Reinforcement Learning by Finn, Yu, Fu, Abbeel, and Levine (Deep RL workshop). This work addresses the scalable oversight problem by proposing the first tractable algorithm for semi-supervised RL. This allows artificial agents to robustly learn reward functions from limited human feedback. The algorithm uses an IRL-like approach to infer the reward function, using the agent’s own prior experiences in the supervised setting as an expert demonstration.

interactive-irlTowards Interactive Inverse Reinforcement Learning by Armstrong and Leike (Reliable ML workshop). This paper studies the incentives of an agent that is trying to learn about the reward function while simultaneously maximizing the reward. The authors discuss some ways to reduce the agent’s incentive to manipulate the reward learning process.off-switch

Should Robots Have Off Switches? by Milli, Hadfield-Menell, and Russell (Reliable ML workshop). This poster examines some adverse effects of incentivizing artificial agents to be compliant in the off-switch game (a variant of CIRL).

Safe exploration

safemdpSafe Exploration in Finite Markov Decision Processes with Gaussian Processes by Turchetta, Berkenkamp, and Krause (main conference). This paper develops a reinforcement learning algorithm called Safe MDP that can explore an unknown environment without getting into irreversible situations, unlike classical RL approaches.intrinsic_fear

Combating Reinforcement Learning’s Sisyphean Curse with Intrinsic Fear by Lipton, Gao, Li, Chen, and Deng (Reliable ML workshop). This work addresses the ‘Sisyphean curse’ of DQN algorithms forgetting past experiences, as they become increasingly unlikely under a new policy, and therefore eventually repeating catastrophic mistakes. The paper introduces an approach called ‘intrinsic fear’, which maintains a model for how likely different states are to lead to a catastrophe within some number of steps.

~~~~~

Most of these papers were related to inverse reinforcement learning – while IRL is a promising approach, it would be great to see more varied safety material at the next NIPS (fingers crossed for some innovative contributions from Rocket AI!). There were some more safety papers on other topics at UAI this summer: Safely Interruptible Agents (formalizing what it means to incentivize an agent to obey shutdown signals) and A Formal Solution to the Grain of Truth Problem (providing a broad theoretical framework for multiple agents learning to predict each other in arbitrary computable games).

(Cross-posted to Approximately Correct and the FLI blog. Thanks to Jan Leike, Zachary Lipton, and Janos Kramar for providing feedback on this post.)

OpenAI unconference on machine learning

Last weekend, I attended OpenAI’s self-organizing conference on machine learning (SOCML 2016), meta-organized by Ian Goodfellow (thanks Ian!). It was held at OpenAI’s new office, with several floors of large open spaces. The unconference format was intended to encourage people to present current ideas alongside with completed work. The schedule mostly consisted of 2-hour blocks with broad topics like “reinforcement learning” and “generative models”, guided by volunteer moderators. I especially enjoyed the sessions on neuroscience and AI and transfer learning, which had smaller and more manageable groups than the crowded popular sessions, and diligent moderators who wrote down the important points on the whiteboard. Overall, I had more interesting conversation but also more auditory overload at SOCML than at other conferences.

To my excitement, there was a block for AI safety along with the other topics. The safety session became a broad introductory Q&A, moderated by Nate Soares, Jelena Luketina and me. Some topics that came up: value alignment, interpretability, adversarial examples, weaponization of AI.

image_1-19ca2ee3d92cf2f6ec91c4a95cd7ae8a48a14bc52f25fb73eb68ac623043477d

AI safety discussion group (image courtesy of Been Kim)

One value alignment question was how to incorporate a diverse set of values that represents all of humanity in the AI’s objective function. We pointed out that there are two complementary problems: 1) getting the AI’s values to be in the small part of values-space that’s human-compatible, and 2) averaging over that space in a representative way. People generally focus on the ways in which human values differ from each other, which leads them to underestimate the difficulty of the first problem and overestimate the difficulty of the second. We also agreed on the importance of allowing for moral progress by not locking in the values of AI systems.

Nate mentioned some alternatives to goal-optimizing agents – quantilizers and approval-directed agents. We also discussed the limitations of using blacklisting/whitelisting in the AI’s objective function: blacklisting is vulnerable to unforeseen shortcuts and usually doesn’t work from a security perspective, and whitelisting hampers the system’s ability to come up with creative solutions (e.g. the controversial move 37 by AlphaGo in the second game against Sedol).

Been Kim brought up the recent EU regulation on the right to explanation for algorithmic decisions. This seems easy to game due to lack of good metrics for explanations. One proposed metric was that a human would be able to predict future model outputs from the explanation. This might fail for better-than-human systems by penalizing creative solutions if applied globally, but seems promising as a local heuristic.

Ian Goodfellow mentioned the difficulties posed by adversarial examples: an imperceptible adversarial perturbation to an image can make a convolutional network misclassify it with very high confidence. There might be some kind of No Free Lunch theorem where making a system more resistant to adversarial examples would trade off with performance on non-adversarial data.

We also talked about dual-use AI technologies, e.g. advances in deep reinforcement learning for robotics that could end up being used for military purposes. It was unclear whether corporations or governments are more trustworthy with using these technologies ethically: corporations have a profit motive, while governments are more likely to weaponize the technology.

unconference_board.jpg

More detailed notes by Janos coming soon! For a detailed overview of technical AI safety research areas, I highly recommend reading Concrete Problems in AI Safety.

Cross-posted to the FLI blog.

Clopen AI: Openness in different aspects of AI development

1-clopen-setThere has been a lot of discussion about the appropriate level of openness in AI research in the past year – the OpenAI announcement, the blog post Should AI Be Open?, a response to the latter, and Nick Bostrom’s thorough paper Strategic Implications of Openness in AI development.

There is disagreement on this question within the AI safety community as well as outside it. Many people are justifiably afraid of concentrating power to create AGI and determine its values in the hands of one company or organization. Many others are concerned about the information hazards of open-sourcing AGI and the resulting potential for misuse. In this post, I argue that some sort of compromise between openness and secrecy will be necessary, as both extremes of complete secrecy and complete openness seem really bad. The good news is that there isn’t a single axis of openness vs secrecy – we can make separate judgment calls for different aspects of AGI development, and develop a set of guidelines.

Information about AI development can be roughly divided into two categories – technical and strategic. Technical information includes research papers, data, source code (for the algorithm, objective function), etc. Strategic information includes goals, forecasts and timelines, the composition of ethics boards, etc. Bostrom argues that openness about strategic information is likely beneficial both in terms of short- and long-term impact, while openness about technical information is good on the short-term, but can be bad on the long-term due to increasing the race condition. We need to further consider the tradeoffs of releasing different kinds of technical information.

Sharing papers and data is both more essential for the research process and less potentially dangerous than sharing code, since it is hard to reconstruct the code from that information alone. For example, it can be difficult to reproduce the results of a neural network algorithm based on the research paper, given the difficulty of tuning the hyperparameters and differences between computational architectures.

Releasing all the code required to run an AGI into the world, especially before it’s been extensively debugged, tested, and safeguarded against bad actors, would be extremely unsafe. Anyone with enough computational power could run the code, and it would be difficult to shut down the program or prevent it from copying itself all over the Internet.

However, releasing none of the source code is also a bad idea. It would currently be impractical, given the strong incentives for AI researchers to share at least part of the code for recognition and replicability. It would also be suboptimal, since sharing some parts of the code is likely to contribute to safety. For example, it would make sense to open-source the objective function code without the optimization code, which would reveal what is being optimized for but not how. This could make it possible to verify whether the objective is sufficiently representative of society’s values – the part of the system that would be the most understandable and important to the public anyway.

It is rather difficult to verify to what extent a company or organization is sharing their technical information on AI development, and enforce either complete openness or secrecy. There is not much downside to specifying guidelines for what is expected to be shared and what isn’t. Developing a joint set of openness guidelines on the short and long term would be a worthwhile endeavor for the leading AI companies today.

(Cross-posted to the FLI blog and Approximately Correct. Thanks to Jelena Luketina and Janos Kramar for their detailed feedback on this post!)

New AI safety research agenda from Google Brain

Google Brain just released an inspiring research agenda, Concrete Problems in AI Safety, co-authored by researchers from OpenAI, Berkeley and Stanford. This document is a milestone in setting concrete research objectives for keeping reinforcement learning agents and other AI systems robust and beneficial. The problems studied are relevant both to near-term and long-term AI safety, from cleaning robots to higher-stakes applications. The paper takes an empirical focus on avoiding accidents as modern machine learning systems become more and more autonomous and powerful.

Reinforcement learning is currently the most promising framework for building artificial agents – it is thus especially important to develop safety guidelines for this subfield of AI. The research agenda describes a comprehensive (though likely non-exhaustive) set of safety problems, corresponding to where things can go wrong when building AI systems:

  • Mis-specification of the objective function by the human designer. Two common pitfalls when designing objective functions are negative side-effects and reward hacking (also known as wireheading), which are likely to happen by default unless we figure out how to guard against them. One of the key challenges is specifying what it means for an agent to have a low impact on the environment while achieving its objectives effectively.

  • Extrapolation from limited information about the objective function. Even with a correct objective function, human supervision is likely to be costly, which calls for scalable oversight of the artificial agent.

  • Extrapolation from limited training data or using an inadequate model. We need to develop safe exploration strategies that avoid irreversibly bad outcomes, and build models that are robust to distributional shift – able to fail gracefully in situations that are far outside the training data distribution.

The AI research community is increasingly focusing on AI safety in recent years, and Google Brain’s agenda is part of this trend. It follows on the heels of the Safely Interruptible Agents paper from Google DeepMind and the Future of Humanity Institute, which investigates how to avoid unintended consequences from interrupting or shutting down reinforcement learning agents. We at FLI are super excited that industry research labs at Google and OpenAI are spearheading and fostering collaboration on AI safety research, and look forward to the outcomes of this work.

(Cross-posted from the FLI blog.)

Introductory resources on AI safety research

At a recent AI safety meetup, people asked for a reading list to get up to speed on the main ideas in the field. The resources are selected for relevance and/or brevity, and the list is not meant to be comprehensive.

Motivation

For a popular audience:

FLI: AI risk background and FAQ. At the bottom of the background page, there is a more extensive list of resources on AI safety.

Tim Urban, Wait But Why: The AI Revolution. An accessible introduction to AI risk forecasts and arguments (with cute hand-drawn diagrams, and a few corrections from Luke Muehlhauser).

GiveWell: Potential risks from advanced artificial intelligence. An overview of AI risks and timelines, possible interventions, and current actors in this space.

Stuart Armstrong. Smarter Than Us: The Rise Of Machine Intelligence. A short ebook discussing potential promises and challenges presented by advanced AI, and the interdisciplinary problems that need to be solved on the way there.

For a more technical audience:

Stuart Russell:

  • The long-term future of AI (longer version). A video of Russell’s classic talk, discussing why it makes sense for AI researchers to think about AI safety, and going over various misconceptions about the issues.
  • Concerns of an AI pioneer. An interview with Russell on the importance of provably aligning AI with human values, and the challenges of value alignment research.
  • On Myths and Moonshine. Russell’s response to the “Myth of AI” question on Edge.org, which draws an analogy between AI research and nuclear research, and points out some dangers of optimizing a misspecified utility function.

Scott Alexander: No time like the present for AI safety work. An overview of long-term AI safety challenges, e.g. preventing wireheading and formalizing ethics.

Victoria Krakovna: AI risk without an intelligence explosion. An overview of long-term AI risks besides the (overemphasized) intelligence explosion / hard takeoff scenario, arguing why intelligence explosion skeptics should still think about AI safety.

Technical overviews

Amodel, Olah et al: Concrete Problems in AI safety

Taylor et al (MIRI): Alightment for Advanced Machine Learning Systems

FLI: A survey of research priorities for robust and beneficial AI

MIRI: Aligning Superintelligence with Human Interests: A Technical Research Agenda

Jacob Steinhardt: Long-Term and Short-Term Challenges to Ensuring the Safety of AI Systems. A taxonomy of AI safety issues that require ordinary vs extraordinary engineering to address.

Nate Soares: Safety engineering, target selection, and alignment theory. Identifies and motivates three major areas of AI safety research.

Nick Bostrom: Superintelligence: Paths, Dangers, Strategies. A seminal book outlining long-term AI risk considerations.

Technical work

Steve Omohundro: The basic AI drives. Argues that sufficiently advanced AI systems are likely to develop drives such as self-preservation and resource acquisition independently of their assigned objectives.

Paul Christiano: AI control. A blog on designing safe, efficient AI systems (approval-directed agents, aligned reinforcement learning agents, etc).

MIRI: Corrigibility. Designing AI systems without incentives to resist corrective modifications by their creators.

Laurent Orseau: Wireheading. An investigation into how different types of artificial agents respond to wireheading opportunities (unintended shortcuts to maximize their objective function).

Collections of papers

MIRI publications

FHI publications

If there are any resources missing from this list that you think are a must-read, please let me know! If you want to go into AI safety research, check out these guidelines and the AI Safety Syllabus.

(Thanks to Ben Sancetta, Taymon Beal and Janos Kramar for their feedback on this post.)