(Coauthored with Ramana Kumar and cross-posted from the Alignment Forum.)
There are a few different classifications of safety problems, including the Specification, Robustness and Assurance (SRA) taxonomy and the Goodhart’s Law taxonomy. In SRA, the specification category is about defining the purpose of the system, i.e. specifying its incentives. Since incentive problems can be seen as manifestations of Goodhart’s Law, we explore how the specification category of the SRA taxonomy maps to the Goodhart taxonomy. The mapping is an attempt to integrate different breakdowns of the safety problem space into a coherent whole. We hope that a consistent classification of current safety problems will help develop solutions that are effective for entire classes of problems, including future problems that have not yet been identified.
The SRA taxonomy defines three different types of specifications of the agent’s objective: ideal (a perfect description of the wishes of the human designer), design (the stated objective of the agent) and revealed (the objective recovered from the agent’s behavior). It then divides specification problems into design problems (e.g. side effects) that correspond to a difference between the ideal and design specifications, and emergent problems (e.g. tampering) that correspond to a difference between the design and revealed specifications.
In the Goodhart taxonomy, there is a variable U* representing the true objective, and a variable U representing the proxy for the objective (e.g. a reward function). The taxonomy identifies four types of Goodhart effects: regressional (maximizing U also selects for the difference between U and U*), extremal (maximizing U takes the agent outside the region where U and U* are correlated), causal (the agent intervenes to maximize U in a way that does not affect U*), and adversarial (the agent has a different goal W and exploits the proxy U to maximize W).
We think there is a correspondence between these taxonomies: design problems are regressional and extremal Goodhart effects, while emergent problems are causal Goodhart effects. The rest of this post will explain and refine this correspondence.
At this year’s EA Global London conference, Jan Leike and I ran a discussion session on the machine learning approach to AI safety. We explored some of the assumptions and considerations that come up as we reflect on different research agendas. Slides for the discussion can be found here.
The discussion focused on two topics. The first topic examined assumptions made by the ML safety approach as a whole, based on the blog post Conceptual issues in AI safety: the paradigmatic gap. The second topic zoomed into specification problems, which both of us work on, and compared our approaches to these problems.
Something I often hear in the machine learning community and media articles is “Worries about superintelligence are a distraction from the *real* problem X that we are facing today with AI” (where X = algorithmic bias, technological unemployment, interpretability, data privacy, etc). This competitive attitude gives the impression that immediate and longer-term safety concerns are in conflict. But is there actually a tradeoff between them?
We can make this question more specific: what resources might these two types of efforts be competing for?
Long-term AI safety is an inherently speculative research area, aiming to ensure safety of advanced future systems despite uncertainty about their design or algorithms or objectives. It thus seems particularly important to have different research teams tackle the problems from different perspectives and under different assumptions. While some fraction of the research might not end up being useful, a portfolio approach makes it more likely that at least some of us will be right.
In this post, I look at some dimensions along which assumptions differ, and identify some underexplored reasonable assumptions that might be relevant for prioritizing safety research. (In the interest of making this breakdown as comprehensive and useful as possible, please let me know if I got something wrong or missed anything important.)
There has been a lot of discussion about the appropriate level of openness in AI research in the past year – the OpenAI announcement, the blog post Should AI Be Open?, a response to the latter, and Nick Bostrom’s thorough paper Strategic Implications of Openness in AI development.
There is disagreement on this question within the AI safety community as well as outside it. Many people are justifiably afraid of concentrating power to create AGI and determine its values in the hands of one company or organization. Many others are concerned about the information hazards of open-sourcing AGI and the resulting potential for misuse. In this post, I argue that some sort of compromise between openness and secrecy will be necessary, as both extremes of complete secrecy and complete openness seem really bad. The good news is that there isn’t a single axis of openness vs secrecy – we can make separate judgment calls for different aspects of AGI development, and develop a set of guidelines.
Among those concerned about risks from advanced AI, I’ve encountered people who would be interested in a career in AI research, but are worried that doing so would speed up AI capability relative to safety. I think it is a mistake for AI safety proponents to avoid going into the field for this reason (better reasons include being well-positioned to do AI safety work, e.g. at MIRI or FHI). This mistake contributed to me choosing statistics rather than computer science for my PhD, which I have some regrets about, though luckily there is enough overlap between the two fields that I can work on machine learning anyway.
I think the value of having more AI experts who are worried about AI safety is far higher than the downside of adding a few drops to the ocean of people trying to advance AI. Here are several reasons for this:
- Concerned researchers can inform and influence their colleagues, especially if they are outspoken about their views.
- Studying and working on AI brings understanding of the current challenges and breakthroughs in the field, which can usefully inform AI safety work (e.g. wireheading in reinforcement learning agents).
- Opportunities to work on AI safety are beginning to spring up within academia and industry, e.g. through FLI grants. In the next few years, it will be possible to do an AI-safety-focused PhD or postdoc in computer science, which would hit two birds with one stone.
“An ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind.”
– Computer scientist I. J. Good, 1965
Artificial intelligence systems we have today can be referred to as narrow AI – they perform well at specific tasks, like playing chess or Jeopardy, and some classes of problems like Atari games. Many experts predict that general AI, which would be able to perform most tasks humans can, will be developed later this century, with median estimates around 2050. When people talk about long term existential risk from the development of general AI, they commonly refer to the intelligence explosion (IE) scenario. AI risk skeptics often argue against AI safety concerns along the lines of “Intelligence explosion sounds like science-fiction and seems really unlikely, therefore there’s not much to worry about”. It’s unfortunate when AI safety concerns are rounded down to worries about IE. Unlike I. J. Good, I do not consider this scenario inevitable (though relatively likely), and I would expect general AI to present an existential risk even if I knew for sure that intelligence explosion were impossible.