(Coauthored with Ramana Kumar and cross-posted from the Alignment Forum.)
There are a few different classifications of safety problems, including the Specification, Robustness and Assurance (SRA) taxonomy and the Goodhart’s Law taxonomy. In SRA, the specification category is about defining the purpose of the system, i.e. specifying its incentives. Since incentive problems can be seen as manifestations of Goodhart’s Law, we explore how the specification category of the SRA taxonomy maps to the Goodhart taxonomy. The mapping is an attempt to integrate different breakdowns of the safety problem space into a coherent whole. We hope that a consistent classification of current safety problems will help develop solutions that are effective for entire classes of problems, including future problems that have not yet been identified.
The SRA taxonomy defines three different types of specifications of the agent’s objective: ideal (a perfect description of the wishes of the human designer), design (the stated objective of the agent) and revealed (the objective recovered from the agent’s behavior). It then divides specification problems into design problems (e.g. side effects) that correspond to a difference between the ideal and design specifications, and emergent problems (e.g. tampering) that correspond to a difference between the design and revealed specifications.
In the Goodhart taxonomy, there is a variable U* representing the true objective, and a variable U representing the proxy for the objective (e.g. a reward function). The taxonomy identifies four types of Goodhart effects: regressional (maximizing U also selects for the difference between U and U*), extremal (maximizing U takes the agent outside the region where U and U* are correlated), causal (the agent intervenes to maximize U in a way that does not affect U*), and adversarial (the agent has a different goal W and exploits the proxy U to maximize W).
We think there is a correspondence between these taxonomies: design problems are regressional and extremal Goodhart effects, while emergent problems are causal Goodhart effects. The rest of this post will explain and refine this correspondence.
This year the ICLR conference hosted topic-based workshops for the first time (as opposed to a single track for workshop papers), and I co-organized the Safe ML workshop. One of the main goals was to bring together near and long term safety research communities.
The workshop was structured according to a taxonomy that incorporates both near and long term safety research into three areas – specification, robustness and assurance.
|Specification: define the purpose of the system
||Robustness: design system to withstand perturbations
||Assurance: monitor and control system activity
- Reward hacking
- Side effects
- Preference learning
- Worst-case robustness
- Safe exploration
We had an invited talk and a contributed talk in each of the three areas.
Research / AI safety:
Rationality / effectiveness:
- Attended the CFAR mentoring workshop in Prague, and started running rationality training sessions with Janos at our group house.
- Started using work cycles – focused work blocks (e.g. pomodoros) with built-in reflection prompts. I think this has increased my productivity and focus to some degree. The prompt “how will I get started?” has been surprisingly helpful given its simplicity.
- Stopped eating processed sugar for health reasons at the end of 2017 and have been avoiding it ever since.
- This has been surprisingly easy, especially compared to my earlier attempts to eat less sugar. I think there are two factors behind this: avoiding sugar made everything taste sweeter (so many things that used to taste good now seem inedibly sweet), and the mindset shift from “this is a luxury that I shouldn’t indulge in” to “this is not food”.
- Unfortunately, I can’t make any conclusions about the effects on my mood variables because of some issues with my data recording process :(.
- Declining levels of insomnia (excluding jetlag):
- 22% of nights in the first half of 2017, 16% in the second half of 2017, 16% in the first half of 2018, 10% in the second half of 2018.
- This is probably an effect of the sleep CBT program I did in 2017, though avoiding sugar might be a factor as well.
- Made some progress on reducing non-research commitments (talks, reviewing, organizing, etc).
- Set up some systems for this: a spreadsheet to keep track of requests to do things (with 0-3 ratings for workload and 0-2 ratings for regret) and a form to fill out whenever I’m thinking of accepting a commitment.
- My overall acceptance rate for commitments has gone down a bit from 29% in 2017 to 24% in 2018. The average regret per commitment went down from 0.66 in 2017 to 0.53 in 2018.
- However, since the number of requests has gone up, I ended up with more things to do overall: 12 commitments with a total of 23 units of workload in 2017 vs 19 commitments with a total of 33 units of workload in 2018. (1 unit of workload ~ 5 hours)
At this year’s EA Global London conference, Jan Leike and I ran a discussion session on the machine learning approach to AI safety. We explored some of the assumptions and considerations that come up as we reflect on different research agendas. Slides for the discussion can be found here.
The discussion focused on two topics. The first topic examined assumptions made by the ML safety approach as a whole, based on the blog post Conceptual issues in AI safety: the paradigmatic gap. The second topic zoomed into specification problems, which both of us work on, and compared our approaches to these problems.
A major challenge in AI safety is reliably specifying human preferences to AI systems. An incorrect or incomplete specification of the objective can result in undesirable behavior like specification gaming or causing negative side effects. There are various ways to make the notion of a “side effect” more precise – I think of it as a disruption of the agent’s environment that is unnecessary for achieving its objective. For example, if a robot is carrying boxes and bumps into a vase in its path, breaking the vase is a side effect, because the robot could have easily gone around the vase. On the other hand, a cooking robot that’s making an omelette has to break some eggs, so breaking eggs is not a side effect.
(image credits: 1, 2, 3)
How can we measure side effects in a general way that’s not tailored to particular environments or tasks, and incentivize the agent to avoid them? This is the central question of our latest paper.
Various examples (and lists of examples) of unintended behaviors in AI systems have appeared in recent years. One interesting type of unintended behavior is finding a way to game the specified objective: generating a solution that literally satisfies the stated objective but fails to solve the problem according to the human designer’s intent. This occurs when the objective is poorly specified, and includes reinforcement learning agents hacking the reward function, evolutionary algorithms gaming the fitness function, etc.
While ‘specification gaming’ is a somewhat vague category, it is particularly referring to behaviors that are clearly hacks, not just suboptimal solutions. A classic example is OpenAI’s demo of a reinforcement learning agent in a boat racing game going in circles and repeatedly hitting the same reward targets instead of actually playing the game.
Since such examples are currently scattered across several lists, I have put together a master list of examples collected from the various existing sources. This list is intended to be comprehensive and up-to-date, and serve as a resource for AI safety research and discussion. If you know of any interesting examples of specification gaming that are missing from the list, please submit them through this form.
Thanks to Gwern Branwen, Catherine Olsson, Alex Irpan, and others for collecting and contributing examples!
Something I often hear in the machine learning community and media articles is “Worries about superintelligence are a distraction from the *real* problem X that we are facing today with AI” (where X = algorithmic bias, technological unemployment, interpretability, data privacy, etc). This competitive attitude gives the impression that immediate and longer-term safety concerns are in conflict. But is there actually a tradeoff between them?
We can make this question more specific: what resources might these two types of efforts be competing for?