Various examples (and lists of examples) of unintended behaviors in AI systems have appeared in recent years. One interesting type of unintended behavior is finding a way to game the specified objective: generating a solution that literally satisfies the stated objective but fails to solve the problem according to the human designer’s intent. This occurs when the objective is poorly specified, and includes reinforcement learning agents hacking the reward function, evolutionary algorithms gaming the fitness function, etc. While ‘specification gaming’ is a somewhat vague category, it is particularly referring to behaviors that are clearly hacks, not just suboptimal solutions.
Since these examples are currently scattered across several lists, I have put together a master list of examples collected from the various existing sources. This list is intended to be comprehensive and up-to-date, and serve as a resource for AI safety research and discussion. If you know of any interesting examples of specification gaming that are missing from the list, please submit them through this form.
Thanks to Gwern Branwen, Catherine Olsson, Alex Irpan, and others for collecting and contributing examples!
This year’s NIPS gave me a general sense that near-term AI safety is now mainstream and long-term safety is slowly going mainstream. On the near-term side, I particularly enjoyed Kate Crawford’s keynote on neglected problems in AI fairness, the ML security workshops, and the Interpretable ML symposium debate that addressed the “do we even need interpretability?” question in a somewhat sloppy but entertaining way. There was a lot of great content on the long-term side, including several oral / spotlight presentations and the Aligned AI workshop.
Long-term AI safety is an inherently speculative research area, aiming to ensure safety of advanced future systems despite uncertainty about their design or algorithms or objectives. It thus seems particularly important to have different research teams tackle the problems from different perspectives and under different assumptions. While some fraction of the research might not end up being useful, a portfolio approach makes it more likely that at least some of us will be right.
In this post, I look at some dimensions along which assumptions differ, and identify some underexplored reasonable assumptions that might be relevant for prioritizing safety research. (In the interest of making this breakdown as comprehensive and useful as possible, please let me know if I got something wrong or missed anything important.)
It’s been an eventful few days at ICLR in the coastal town of Toulon in Southern France, after a pleasant train ride from London with a stopover in Paris for some sightseeing. There was more food than is usually provided at conferences, and I ended up almost entirely subsisting on tasty appetizers. The parties were memorable this year, including one in a vineyard and one in a naval museum. The overall theme of the conference setting could be summarized as “finger food and ships”.
There were a lot of interesting papers this year, especially on machine learning security, which will be the focus on this post. (Here is a great overview of the topic.)
This year’s Neural Information Processing Systems conference was larger than ever, with almost 6000 people attending, hosted in a huge convention center in Barcelona, Spain. The conference started off with two exciting announcements on open-sourcing collections of environments for training and testing general AI capabilities – the DeepMind Lab and the OpenAI Universe. Among other things, this is promising for testing safety properties of ML algorithms. OpenAI has already used their Universe environment to give an entertaining and instructive demonstration of reward hacking that illustrates the challenge of designing robust reward functions.
I was happy to see a lot of AI-safety-related content at NIPS this year. The ML and the Law symposium and Interpretable ML for Complex Systems workshop focused on near-term AI safety issues, while the Reliable ML in the Wild workshop also covered long-term problems. Here are some papers relevant to long-term AI safety:
Last weekend, I attended OpenAI’s self-organizing conference on machine learning (SOCML 2016), meta-organized by Ian Goodfellow (thanks Ian!). It was held at OpenAI’s new office, with several floors of large open spaces. The unconference format was intended to encourage people to present current ideas alongside with completed work. The schedule mostly consisted of 2-hour blocks with broad topics like “reinforcement learning” and “generative models”, guided by volunteer moderators. I especially enjoyed the sessions on neuroscience and AI and transfer learning, which had smaller and more manageable groups than the crowded popular sessions, and diligent moderators who wrote down the important points on the whiteboard. Overall, I had more interesting conversation but also more auditory overload at SOCML than at other conferences.
To my excitement, there was a block for AI safety along with the other topics. The safety session became a broad introductory Q&A, moderated by Nate Soares, Jelena Luketina and me. Some topics that came up: value alignment, interpretability, adversarial examples, weaponization of AI.
AI safety discussion group (image courtesy of Been Kim)