Various examples (and lists of examples) of unintended behaviors in AI systems have appeared in recent years. One interesting type of unintended behavior is finding a way to game the specified objective: generating a solution that literally satisfies the stated objective but fails to solve the problem according to the human designer’s intent. This occurs when the objective is poorly specified, and includes reinforcement learning agents hacking the reward function, evolutionary algorithms gaming the fitness function, etc. While ‘specification gaming’ is a somewhat vague category, it is particularly referring to behaviors that are clearly hacks, not just suboptimal solutions.
Since these examples are currently scattered across several lists, I have put together a master list of examples collected from the various existing sources. This list is intended to be comprehensive and up-to-date, and serve as a resource for AI safety research and discussion. If you know of any interesting examples of specification gaming that are missing from the list, please submit them through this form.
Thanks to Gwern Branwen, Catherine Olsson, Alex Irpan, and others for collecting and contributing examples!
Something I often hear in the machine learning community and media articles is “Worries about superintelligence are a distraction from the *real* problem X that we are facing today with AI” (where X = algorithmic bias, technological unemployment, interpretability, data privacy, etc). This competitive attitude gives the impression that immediate and longer-term safety concerns are in conflict. But is there actually a tradeoff between them?
We can make this question more specific: what resources might these two types of efforts be competing for?
This year’s NIPS gave me a general sense that near-term AI safety is now mainstream and long-term safety is slowly going mainstream. On the near-term side, I particularly enjoyed Kate Crawford’s keynote on neglected problems in AI fairness, the ML security workshops, and the Interpretable ML symposium debate that addressed the “do we even need interpretability?” question in a somewhat sloppy but entertaining way. There was a lot of great content on the long-term side, including several oral / spotlight presentations and the Aligned AI workshop.
I just spent a week in Japan to speak at the inaugural symposium on AI & Society – my first conference in Asia. It was inspiring to take part in an increasingly global conversation about AI impacts, and interesting to see how the Japanese AI community thinks about these issues. Overall, Japanese researchers seemed more open to discussing controversial topics like human-level AI and consciousness than their Western counterparts. Most people were more interested in near-term AI ethics concerns but also curious about long term problems.
The talks were a mix of English and Japanese with translation available over audio (high quality but still hard to follow when the slides are in Japanese). Here are some tidbits from my favorite talks and sessions.
Long-term AI safety is an inherently speculative research area, aiming to ensure safety of advanced future systems despite uncertainty about their design or algorithms or objectives. It thus seems particularly important to have different research teams tackle the problems from different perspectives and under different assumptions. While some fraction of the research might not end up being useful, a portfolio approach makes it more likely that at least some of us will be right.
In this post, I look at some dimensions along which assumptions differ, and identify some underexplored reasonable assumptions that might be relevant for prioritizing safety research. (In the interest of making this breakdown as comprehensive and useful as possible, please let me know if I got something wrong or missed anything important.)
I’ve been collecting data about myself on a daily basis for the past 3 years. Half a year ago, I switched from using 42goals (which I only remembered to fill out once every few days) to a Google form emailed to me daily (which I fill out consistently because I check email often). Now for the moment of truth – a correlation matrix!
The data consists of “mood variables” (anxiety, tiredness, and “zoneout” – how distracted / spacey I’m feeling), “action variables” (exercise and meditation) and sleep variables (hours of sleep, sleep start/end time, insomnia). There are 5 binary variables (meditation, exercise, evening/morning insomnia, headache) and the rest are ordinal or continuous. Almost all the variables have 6 months of data, except that I started tracking anxiety 5 months ago and zoneout 2 months ago.