This year the ICLR conference hosted topic-based workshops for the first time (as opposed to a single track for workshop papers), and I co-organized the Safe ML workshop. One of the main goals was to bring together near and long term safety research communities.
The workshop was structured according to a taxonomy that incorporates both near and long term safety research into three areas – specification, robustness and assurance.
|Specification: define the purpose of the system
||Robustness: design system to withstand perturbations
||Assurance: monitor and control system activity
- Reward hacking
- Side effects
- Preference learning
- Worst-case robustness
- Safe exploration
We had an invited talk and a contributed talk in each of the three areas.
Research / AI safety:
Rationality / effectiveness:
- Attended the CFAR mentoring workshop in Prague, and started running rationality training sessions with Janos at our group house.
- Started using work cycles – focused work blocks (e.g. pomodoros) with built-in reflection prompts. I think this has increased my productivity and focus to some degree. The prompt “how will I get started?” has been surprisingly helpful given its simplicity.
- Stopped eating processed sugar for health reasons at the end of 2017 and have been avoiding it ever since.
- This has been surprisingly easy, especially compared to my earlier attempts to eat less sugar. I think there are two factors behind this: avoiding sugar made everything taste sweeter (so many things that used to taste good now seem inedibly sweet), and the mindset shift from “this is a luxury that I shouldn’t indulge in” to “this is not food”.
- Unfortunately, I can’t make any conclusions about the effects on my mood variables because of some issues with my data recording process :(.
- Declining levels of insomnia (excluding jetlag):
- 22% of nights in the first half of 2017, 16% in the second half of 2017, 16% in the first half of 2018, 10% in the second half of 2018.
- This is probably an effect of the sleep CBT program I did in 2017, though avoiding sugar might be a factor as well.
- Made some progress on reducing non-research commitments (talks, reviewing, organizing, etc).
- Set up some systems for this: a spreadsheet to keep track of requests to do things (with 0-3 ratings for workload and 0-2 ratings for regret) and a form to fill out whenever I’m thinking of accepting a commitment.
- My overall acceptance rate for commitments has gone down a bit from 29% in 2017 to 24% in 2018. The average regret per commitment went down from 0.66 in 2017 to 0.53 in 2018.
- However, since the number of requests has gone up, I ended up with more things to do overall: 12 commitments with a total of 23 units of workload in 2017 vs 19 commitments with a total of 33 units of workload in 2018. (1 unit of workload ~ 5 hours)
At this year’s EA Global London conference, Jan Leike and I ran a discussion session on the machine learning approach to AI safety. We explored some of the assumptions and considerations that come up as we reflect on different research agendas. Slides for the discussion can be found here.
The discussion focused on two topics. The first topic examined assumptions made by the ML safety approach as a whole, based on the blog post Conceptual issues in AI safety: the paradigmatic gap. The second topic zoomed into specification problems, which both of us work on, and compared our approaches to these problems.
A major challenge in AI safety is reliably specifying human preferences to AI systems. An incorrect or incomplete specification of the objective can result in undesirable behavior like specification gaming or causing negative side effects. There are various ways to make the notion of a “side effect” more precise – I think of it as a disruption of the agent’s environment that is unnecessary for achieving its objective. For example, if a robot is carrying boxes and bumps into a vase in its path, breaking the vase is a side effect, because the robot could have easily gone around the vase. On the other hand, a cooking robot that’s making an omelette has to break some eggs, so breaking eggs is not a side effect.
(image credits: 1, 2, 3)
How can we measure side effects in a general way that’s not tailored to particular environments or tasks, and incentivize the agent to avoid them? This is the central question of our latest paper.
Various examples (and lists of examples) of unintended behaviors in AI systems have appeared in recent years. One interesting type of unintended behavior is finding a way to game the specified objective: generating a solution that literally satisfies the stated objective but fails to solve the problem according to the human designer’s intent. This occurs when the objective is poorly specified, and includes reinforcement learning agents hacking the reward function, evolutionary algorithms gaming the fitness function, etc.
While ‘specification gaming’ is a somewhat vague category, it is particularly referring to behaviors that are clearly hacks, not just suboptimal solutions. A classic example is OpenAI’s demo of a reinforcement learning agent in a boat racing game going in circles and repeatedly hitting the same reward targets instead of actually playing the game.
Since such examples are currently scattered across several lists, I have put together a master list of examples collected from the various existing sources. This list is intended to be comprehensive and up-to-date, and serve as a resource for AI safety research and discussion. If you know of any interesting examples of specification gaming that are missing from the list, please submit them through this form.
Thanks to Gwern Branwen, Catherine Olsson, Alex Irpan, and others for collecting and contributing examples!
Something I often hear in the machine learning community and media articles is “Worries about superintelligence are a distraction from the *real* problem X that we are facing today with AI” (where X = algorithmic bias, technological unemployment, interpretability, data privacy, etc). This competitive attitude gives the impression that immediate and longer-term safety concerns are in conflict. But is there actually a tradeoff between them?
We can make this question more specific: what resources might these two types of efforts be competing for?